Python API (cnvlib package)

Module cnvlib contents[source]

Parse a file as a copy number or copy ratio table (.cnn, .cnr).

The one function exposed at the top level, read, loads a file in CNVkit’s native format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using the CopyNumArray and GenomicArray methods available on this returned object.

Core classes

The core objects used throughout CNVkit. The base class is GenomicArray. All of these classes wrap a pandas DataFrame instance accessible through the .data attribute which can be used for any manipulations that aren’t already provided by methods in the wrapper class.


A generic array of genomic positions.

class cnvlib.gary.GenomicArray(data_table, meta_dict=None)[source]

Bases: object

An array of genomic intervals. Base class for genomic data structures.

Can represent most BED-like tabular formats with arbitrary additional columns.


Combine this array’s data with another GenomicArray (in-place).

Any optional columns must match between both arrays.


Create a new CNA, adding the specified extra columns to this CNA.


Wrap the named columns in this instance’s metadata.


Wrap the given pandas dataframe in this instance’s metadata.


Wrap the given rows in this instance’s metadata.


Select chromosomes w/ integer names, ignoring any ‘chr’ prefixes.


Iterate over bins grouped by chromosome name.

by_ranges(other, mode='inner', keep_empty=True)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

Returns an iterable of (bin, GenomicArray of overlapping rows)). Usually used for grouping probes or SNVs by CNV segments.

mode determines what to do with bins that overlap a boundary of the selection. Values are:

  • inner: Drop the bins on the selection boundary, don’t emit them.
  • outer: Keep/emit those bins as they are.
  • trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position. [default]

Bins in this array that fall outside the other array’s bins are skipped.


Concatenate several GenomicArrays, keeping this array’s metadata.

This array’s data table is not implicitly included in the result.


Iterate over plain coordinates of each bin: chromosome, start, end.

With also, also include those columns.

Example, yielding rows in BED format:

>>> probes.coords(also=["name", "strand"])

Create an independent copy of this object.


Remove any optional columns from this GenomicArray.

Returns a new copy with only the core columns retained:
log2 value, chromosome, start, end, bin name.
classmethod from_columns(columns, meta_dict=None)[source]

Create a new instance from column arrays, given as a dict.

classmethod from_rows(rows, columns=None, meta_dict=None)[source]

Create a new instance from a list of rows, as tuples or arrays.

in_range(chrom=None, start=None, end=None, mode='inner')[source]

Get the GenomicArray portion within the given genomic range.

mode works as in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.

in_ranges(chrom=None, starts=None, ends=None, mode='inner')[source]

Get the GenomicArray portion within the specified ranges.

Same as in_ranges but the starts and ends are arrays of equal length, and the output concatenates all the selected bins.


Extract a subset of columns, reusing this instance’s metadata.

match_to_bins(other, key, default=0.0, fill=False, summary_func=<function median>)[source]

Take values of the other array at each of this array’s bins.

Assign default to indices that fall outside the other array’s bins, or chromosomes that appear in self but not other.

Return an array of the key column values in other corresponding to this array’s bin locations, the same length as this array.

classmethod read(infile, sample_id=None)[source]
static row2label(row)[source]
select(selector=None, **kwargs)[source]

Take a subset of rows where the given condition is true.

Arguments can be a function (lambda expression) returning a bool, which will be used to select True rows, and/or keyword arguments like gene=”Background” or chromosome=”chr7”, which will select rows where the keyed field equals the specified value.


Randomize the order of bins in this array (in-place).


Sort this array’s bins in-place, with smart chromosome ordering.


Sort this array’s columns in-place, per class definition.


Write the wrapped data table to a file or handle in tabular format.

The format is BED-like, but with a header row included and with arbitrary extra columns.

To combine multiple samples in one file and/or convert to another format, see the ‘export’ subcommand.


CNVkit’s core data structure, a copy number array.

class cnvlib.cnary.CopyNumArray(data_table, meta_dict=None)[source]

Bases: cnvlib.gary.GenomicArray

An array of genomic intervals, treated like aCGH probes.

Required columns: chromosome, start, end, gene, log2

Optional columns: gc, rmask, spread, weight, probes

by_gene(ignore=('-', '.', 'CGH'))[source]

Iterate over probes grouped by gene name.

Emits pairs of (gene name, CNA of rows with same name)

Groups each series of intergenic bins as a ‘Background’ gene; any ‘Background’ bins within a gene are grouped with that gene. Bins with names in ignore are treated as ‘Background’ bins, but retain their name.

center_all(estimator=<function median>, skip_low=False)[source]

Recenter coverage values to the autosomes’ average (in-place).


Drop bins with extremely low log2 coverage values.

These are generally bins that had no reads mapped, and so were substituted with a small dummy log2 value to avoid divide-by-zero errors.


Get the uninformed expected copy ratios of each bin.

Create an array of log2 coverages like a “flat” reference.

This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).


Get the relative log-coverage of chrX in a sample.

guess_average_depth(segments=None, window=100)[source]

Estimate the effective average read depth from variance.

Assume read depths are Poisson distributed, converting log2 values to absolute counts. Then the mean depth equals the variance , and the average read depth is the estimated mean divided by the estimated variance. Use robust estimators (Tukey’s biweight location and midvariance) to compensate for outliers and overdispersion.

With segments, take the residuals of this array’s log2 values from those of the segments to remove the confounding effect of real CNVs.

If window is an integer, calculate and subtract a smoothed trendline to remove the effect of CNVs without segmentation (skipped if segments are given).


guess_xx(male_reference=False, verbose=True)[source]

Guess whether a sample is female from chrX relative coverages.

Recommended cutoff values:
-0.5 – raw target data, not yet corrected +0.5 – probe data already corrected on a male profile

Difference in log2 value of each bin from its segment mean.

If segments are just regions (e.g. RegionArray) with no log2 values precalculated, subtract the median of this array’s log2 values within each region. If no segments are given, subtract each chromosome’s median.


Adjust chrX coverages (divide in half) for apparent female samples.

squash_genes(summary_func=<function biweight_location>, squash_background=False, ignore=('-', '.', 'CGH'))[source]

Combine consecutive bins with the same targeted gene name.

The ignore parameter lists bin names that not be counted as genes to be output.

Parameter summary_func is a function that summarizes an array of coverage values to produce the “squashed” gene’s coverage value. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.


An array of genomic regions or features.

class cnvlib.rary.RegionArray(data_table, meta_dict=None)[source]

Bases: cnvlib.gary.GenomicArray

An array of genomic intervals.

classmethod read(fname, sample_id=None, fmt=None)[source]

Read regions in any of the expected file formats.

Iterates over tuples of the tabular contents. Header lines are skipped.

Start and end coordinates are base-0, half-open.

write(outfile=<open file '<stdout>', mode 'w'>, fmt='bed', verbose=True)[source]


An array of genomic intervals, treated as variant loci.

class cnvlib.vary.VariantArray(data_table, meta_dict=None)[source]

Bases: cnvlib.gary.GenomicArray

An array of genomic intervals, treated as variant loci.

Required columns: chromosome, start, end, ref, alt

mirrored_baf(above_half=True, tumor_boost=True)[source]

Mirrored B-allele frequencies (BAFs).

Flip BAFs to be all above 0.5 (if above_half) or below 0.5, for consistency.

With tumor_boost, normalize tumor-sample allele frequencies to the matched normal allele frequencies.

classmethod read_vcf(infile, sample_id=None, normal_id=None, min_depth=None, skip_hom=False, skip_reject=False, skip_somatic=False)[source]

Parse SNV coordinates from a VCF file into a VariantArray.


TumorBoost normalization of tumor-sample allele frequencies.

De-noises the signal for detecting LOH.

Interface to CNVkit sub-commands


The public API for each of the commands defined in the CNVkit workflow.

Command-line interface and corresponding API for CNVkit.

cnvlib.commands.batch_make_reference(normal_bams, target_bed, antitarget_bed, male_reference, fasta, annotate, short_names, split, target_avg_size, access, antitarget_avg_size, antitarget_min_size, output_reference, output_dir, processes, by_count)[source]

Build the CN reference from normal samples, targets and antitargets.

cnvlib.commands.batch_run_sample(bam_fname, target_bed, antitarget_bed, ref_fname, output_dir, male_reference=False, scatter=False, diagram=False, rlibpath=None, by_count=False)[source]

Run the pipeline on one BAM file.

cnvlib.commands.batch_write_coverage(bed_fname, bam_fname, out_fname, by_count)[source]

Run coverage on one sample, write to file.

cnvlib.commands.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.commands.do_antitarget(target_bed, access_bed=None, avg_bin_size=100000, min_bin_size=None)[source]

Derive a background/antitarget BED file from a target BED file.

cnvlib.commands.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.commands.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.commands.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.commands.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.commands.do_gainloss(probes, segments=None, male_reference=False, threshold=0.2, min_probes=3, skip_low=False)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_heatmap(cnarrs, show_range=None, do_desaturate=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.commands.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]
cnvlib.commands.do_reference(target_fnames, antitarget_fnames, fa_fname=None, male_reference=False, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.commands.do_reference_flat(targets, antitargets, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.commands.do_rescale(cnarr, ploidy=2, purity=None, is_reference_male=False, is_sample_female=False)[source]
cnvlib.commands.do_scatter(cnarr, segments=None, variants=None, show_range=None, show_gene=None, background_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None)[source]

Plot probe log2 coverages and CBS calls together.

show_gene: name of gene to highligh show_range: chromosome name or coordinate string like “chr1:20-30”

cnvlib.commands.do_targets(bed_fname, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.


Parse the command line.


Display this program’s version.

cnvlib.commands.rescale_baf(purity, observed_baf, normal_baf=0.5)[source]

Adjust B-allele frequencies for sample purity.

t_baf*purity + n_baf*(1-purity) = obs_baf obs_baf - n_baf * (1-purity) = t_baf * purity t_baf = (obs_baf - n_baf * (1-purity))/purity
cnvlib.commands.verify_gender_arg(cnarr, gender_arg, is_male_reference)[source]

The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.


Supporting functions for the ‘antitarget’ command.

cnvlib.antitarget.find_background_regions(access_chroms, target_chroms, pad_size)[source]

Take coordinates of accessible regions and targets; emit antitargets.

cnvlib.antitarget.get_background(target_bed, access_bed, avg_bin_size, min_bin_size)[source]

Generate background intervals from target intervals.


  • Invert target intervals

  • Subtract the inverted targets from accessible regions

  • For each of the resulting regions:

    • Shrink by a fixed margin on each end
    • If it’s smaller than min_bin_size, skip
    • Divide into equal-size (region_size/avg_bin_size) portions
    • Emit the (chrom, start, end) coords of each portion
cnvlib.antitarget.guess_chromosome_regions(target_chroms, telomere_size)[source]

Determine (minimum) chromosome lengths from target coordinates.


Call copy number variants from segmented log2 ratios., ploidy, purity, is_reference_male, is_sample_female)[source]

Calculate absolute copy number values from segment or bin log2 ratios., ploidy, purity, is_reference_male, is_sample_female)[source]

Absolute, expected and reference copy number in a DataFrame., ploidy, is_sample_female)[source]

Absolute integer number of expected copies in each bin.

I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified gender., ploidy, is_reference_male)[source]

Calculate absolute copy number values from segment or bin log2 ratios., ploidy, is_reference_male)[source]

Absolute integer number of reference copies in each bin.

I.e. the given ploidy for autosomes, 1 or 2 X according to the reference gender, and always 1 copy of Y., ploidy, thresholds, is_reference_male)[source]

Call integer copy number using hard thresholds for each level.

Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.

Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:

R> log2(2:6 / 4)
-1.0  -0.4150375  0.0  0.3219281  0.5849625

Allowing for random noise of +/- 0.1, the cutoffs are:

DEL(0)  <  -1.1
LOSS(1) <  -0.25
GAIN(3) >=  +0.2
AMP(4)  >=  +0.7

For germline samples, better precision could be achieved with:

LOSS(1) <  -0.4
GAIN(3) >=  +0.3, absolutes, ploidy, is_reference_male, min_abs_val=0.001, round_to_int=False)[source]

Convert absolute copy numbers to log2 ratios.

Optionally round copy numbers to integers.

Account for reference gender & ploidy of sex chromosomes.


Supporting functions for the ‘antitarget’ command.


Count the total number of mapped reads in a BAM file.

Uses the BAM index to do this quickly.

cnvlib.coverage.bedcov(bed_fname, bam_fname, min_mapq)[source]

Calculate depth of all regions in a BED file via samtools (pysam) bedcov.

i.e. mean pileup depth across each region.

cnvlib.coverage.interval_coverages(bed_fname, bam_fname, by_count, min_mapq)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_count(bed_fname, bam_fname, min_mapq)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_pileup(bed_fname, bam_fname, min_mapq)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.region_depth_count(bamfile, chrom, start, end, min_mapq)[source]

Calculate depth of a region via pysam count.

i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.

Coordinates are 0-based, per pysam.


Chromosome diagram drawing functions.

This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)

cnvlib.diagram.bc_chromosome_draw_label(self, cur_drawing, label_name)[source]

Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.

Draw a label for the chromosome. Mod: above the chromosome, not below.

cnvlib.diagram.bc_organism_draw(org, title, wrap=12)[source]

Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.

Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.


  • title: The output title of the produced document.
cnvlib.diagram.build_chrom_diagram(features, chr_sizes, sample_id)[source]

Create a PDF of color-coded features on chromosomes.

cnvlib.diagram.create_diagram(cnarr, segarr, threshold, min_probes, outfname, male_reference)[source]

Create the diagram.


Export CNVkit objects and files to other formats.


Map chromosome names to integers in the order encountered.

cnvlib.export.export_bed(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]

Convert a copy number array to a BED-like DataFrame.

For each region in each sample (possibly filtered according to show), the columns are:

  • reference sequence name
  • start (0-indexed)
  • end
  • sample name or given label
  • integer copy number

By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.


Biodiscovery Nexus Copy Number “basic” format.

Only represents one sample per file.

cnvlib.export.export_nexus_ogt(sample_fname, vcf_fname, sample_id, min_depth=20, min_weight=0.0)[source]

Biodiscovery Nexus Copy Number “Custom-OGT” format.

To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above .5, then the median of those values is taken.


SEG format for copy number segments.

Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.

cnvlib.export.export_theta(tumor, reference)[source]

Convert tumor segments and normal .cnr or reference .cnn to THetA input.

Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.

The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.

THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount

where chromosome IDs (“chrm”) are integers 1 through 24.

cnvlib.export.export_vcf(segments, ploidy, is_reference_male, is_sample_female, sample_id=None)[source]

Convert segments to Variant Call Format.

For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)


cnvlib.export.fmt_cdt(sample_ids, table)[source]

Format as CDT.

cnvlib.export.fmt_gct(sample_ids, table)[source]
cnvlib.export.fmt_jtv(sample_ids, table)[source]

Format for Java TreeView.

cnvlib.export.format_theta_row(seg, cnarr, chrom_id, log2_to_count)[source]

Convert a segment’s info to a row of THetA input.

For the normal/reference bin count, take the mean of the bin values within each segment so that segments match between tumor and normal.


Merge probe values from multiple samples into a 2D table (of sorts).

dict of {sample ID: (probes, values)}
list-of-tuples: (probe, log2 coverages...)
cnvlib.export.segments2vcf(segments, ploidy, is_reference_male, is_sample_female)[source]

Convert copy number segments to VCF records.


Supporting functions for the ‘fix’ command.

cnvlib.fix.apply_weights(cnarr, ref_matched, epsilon=0.0001)[source]

Calculate weights for each bin.

Weights are derived from:

  • bin sizes
  • average bin coverage depths in the reference
  • the “spread” column of the reference.
cnvlib.fix.center_by_window(cnarr, fraction, sort_key)[source]

Smooth out biases according to the trait specified by sort_key.

E.g. correct GC-biased probes by windowed averaging across similar-GC probes; or for similar interval sizes.

cnvlib.fix.edge_gains(target_sizes, gap_sizes, insert_size)[source]

Calculate coverage gain from neighboring baits’ flanking reads.

Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:

(i-g)^2 / 4it

If the neighbor flank extends beyond the target (t+g < i), reduce by:

(i-t-g)^2 / 4it

If a neighbor overlaps the target, treat it as adjacent (gap size 0).

cnvlib.fix.edge_losses(target_sizes, insert_size)[source]

Calculate coverage losses at the edges of baited regions.

Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:


If the “shoulders” extend outside the bait $(t < i), reduce by:

(i-t)^2 / 4it

on each side, or (i-t)^2 / 2it total.

cnvlib.fix.get_edge_bias(cnarr, margin)[source]

Quantify the “edge effect” of the target tile and its neighbors.

The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.

(This is not the actual change in coverage. This is just a tribute.)

cnvlib.fix.load_adjust_coverages(pset, ref_pset, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Load and filter probe coverages; correct using reference and GC.


Flag the probes with excessively low or inconsistent coverage.

Returns a bool array where True indicates probes that failed the checks.

cnvlib.fix.match_ref_to_probes(ref_pset, probes)[source]

Filter the reference probes to match the target or antitarget probe set.


Import from other formats to the CNVkit format.


Search the given paths for ‘targetcoverage’ CSV files.

Per the convention we use in our Picard applets, the target coverage file names end with ‘.targetcoverage.csv’; anti-target coverages end with ‘.antitargetcoverage.csv’.


Parse a Picard CalculateHsMetrics PER_TARGET_COVERAGE file.

Return a CopyNumArray.

Input column names:
chrom (str), start, end, length (int), name (str), %gc, mean_coverage, normalized_coverage (float)
cnvlib.importers.import_seg(segfname, chrom_names, chrom_prefix, from_log10)[source]

Parse a SEG file as an iterable of CopyNumArray instances.

Map (string) chromosome IDs to names. (Applied before chrom_prefix.) e.g. {‘23’: ‘X’, ‘24’: ‘Y’, ‘25’: ‘M’}
chrom_prefix: prepend this string to chromosome names
(usually ‘chr’ or None)

from_log10: Convert values from log10 to log2.


Parse THetA results into a data structure.

Columns: NLL, mu, C, p*


Fix the duplicated gene names Picard spits out.

Return a string containing the single gene name, sans duplications and pipe characters.

Picard CalculateHsMetrics combines the labels of overlapping intervals by joining all labels with ‘|’, e.g. ‘BRAF|BRAF’ – no two distinct targeted genes actually overlap, though, so these dupes are redundant. Meaningless target names are dropped, e.g. ‘CGH|FOO|-‘ resolves as ‘FOO’. In case of ambiguity, the longest name is taken, e.g. “TERT|TERT Promoter” resolves as “TERT Promoter”.


Supporting functions for the ‘reference’ command.


Create neutral-coverage probes from intervals.


Calculate the GC and lowercase (RepeatMasked) content of a string.

cnvlib.reference.combine_probes(filenames, fa_fname, is_male_reference, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Calculate the median coverage of each bin across multiple samples.

List of .cnn files, as generated by ‘coverage’ or ‘import-picard’. fa_fname: fil columns for GC and RepeatMasker genomic values.
A single CopyNumArray summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and genomic GC content.
cnvlib.reference.get_fasta_stats(probes, fa_fname)[source]

Calculate GC and RepeatMasker content of each bin in the FASTA genome.

cnvlib.reference.reference2regions(reference, coord_only=False)[source]

Extract iterables of target and antitarget regions from a reference.

Like loading two BED files with ngfrills.parse_regions.


Warn about target probes where coverage is poor.

Prints a formatted table to stderr.


Supports the sub-commands breaks and gainloss.

Supporting functions for the text/tabular-reporting commands.

Namely: breaks, gainloss.

cnvlib.reports.gainloss_by_gene(probes, threshold, skip_low=False)[source]

Identify genes where average bin copy ratio value exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.gainloss_by_segment(probes, segments, threshold, skip_low=False)[source]

Identify genes where segmented copy ratio exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.get_breakpoints(intervals, segments, min_probes)[source]

Identify CBS segment breaks within the targeted intervals.

cnvlib.reports.get_gene_intervals(all_probes, ignore=('-', '.', 'CGH'))[source]

Tally genomic locations of each targeted gene.

Return a dict of chromosomes to a list of tuples: (gene name, start, end).

cnvlib.reports.group_by_genes(probes, skip_low)[source]

Group probe and coverage data by gene.

Return an iterable of genes, in chromosomal order, associated with their location and coverages:

[(gene, chrom, start, end, [coverages]), ...]


Segmentation of copy number values.

cnvlib.segmentation.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None)[source]

Infer copy number segments from the given coverage table.

cnvlib.segmentation.drop_outliers(cnarr, width, factor)[source]

Drop outlier bins with log2 ratios too far from the trend line.

Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.

cnvlib.segmentation.repair_segments(segments, orig_probes)[source]

Post-process segmentation output.

  1. Ensure every chromosome has at least one segment.
  2. Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
  3. Store probe-level gene names, comma-separated, as the segment name.

Convert R dataframe contents (SEG) to our native tabular format.

Return a pandas.Dataframe with CNA columns.


Combine contiguous segments.

cnvlib.segmentation.transfer_names_weights(segments, cnarr, ignore=('-', '.', 'CGH'))[source]

Copy gene names from cnarr to the segmented segarr.

Segment name is the comma-separated list of bin gene names.


Transform bait intervals into targets more suitable for CNVkit., refflat_fname)[source]

Apply RefSeq gene names to a list of targeted regions., refflat_fname, default_name='-')[source]

Replace the interval gene names with those at the same loc in refFlat.txt, names)[source]

Try filtering names again. Format the row for yielding., exclude=('mRNA', ))[source]

Remove less-meaningful accessions from the given set.[source]

Parse one line of refFlat.txt; return relevant info.

Pair up the exon start and end positions. Add strand direction to each chromosome as a “+”/”-” suffix (it will be stripped off later).[source]

Parse genes; merge those with same name and overlapping regions.

Returns a dict of: {(chrom, strand): [(gene start, gene end, gene name), ...]}[source]

Reduce multi-accession interval labels to the minimum consistent.

So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.


... chr1 879125 879593 + mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104 chr1 880158 880279 + ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L ...


chr1 879125 879593 + mRNA|AF161376 chr1 880158 880279 + mRNA|AF161376[source]

Return the shortest trimmed name from the given set., avg_size)[source]

Split large tiled intervals into smaller, consecutive targets.

For each of the tiled regions:

  • Divide into equal-size (tile_size/avg_size) portions
  • Emit the (chrom, start, end) coords of each portion

Bin the regions according to avg_size.

Helper modules


CNV utilities.

cnvlib.core.assert_equal(msg, **values)[source]

Evaluate and compare two or more values for equality.

Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.


>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy']))
ValueError: Mismatch: expected = 1, saw = 2
cnvlib.core.check_unique(items, title)[source]

Ensure all items in an iterable are identical; return that one item.


Strip directory and all extensions from a filename.


Strip directory and final extension from a filename.


Create a sorting key from chromosome label.

Sort by integers first, then letters or strings. The prefix “chr” (case-insensitive), if present, is stripped automatically for sorting.

E.g. chr1 < chr2 < chr10 < chrX < chrY < chrM


Create a sort key function that gets chromosome label at a list index.

cnvlib.core.write_dataframe(outfname, dframe, header=True)[source]

Write a pandas.DataFrame to a tabular file.

cnvlib.core.write_text(outfname, text, *more_texts)[source]

Write one or more strings (blocks of text) to a file.

cnvlib.core.write_tsv(outfname, rows, colnames=None)[source]

Write rows, with optional column header, to tabular file.


Robust estimators of central tendency and scale.

For use in evaluating performance of copy number estimation.

cnvlib.metrics.biweight_location(a, initial=None, c=6.0, epsilon=0.0001)[source]

Compute the biweight location for an array.

The biweight is a robust statistic for determining the central location of a distribution.

cnvlib.metrics.biweight_midvariance(a, initial=None, c=9.0, epsilon=0.0001)[source]

Compute the biweight midvariance for an array.

The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.


cnvlib.metrics.confidence_interval_bootstrap(bins, alpha, bootstraps=100)[source]

Confidence interval for segment mean log2 value, estimated by bootstrap.


Estimators of scale: standard deviation, MAD, biweight midvariance.

Calculates all of these values for an array of deviations and returns them as a tuple.


Compute the difference between the array’s first and third quartiles.

cnvlib.metrics.median_absolute_deviation(a, scale_to_sd=True)[source]

Compute the median absolute deviation (MAD) of array elements.

The MAD is defined as: median(abs(a - median(a))).



Return the modal value of an array’s values.

The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.

arr is a 1-D array of floating-point values, e.g. bin log2 ratio values.

cnvlib.metrics.prediction_interval(bins, alpha)[source]

Prediction interval, estimated by percentiles.


Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.

Qn := Cn first quartile of (|x_i - x_j|: i < j)

where Cn is a constant depending on n.

Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.

n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
cnvlib.metrics.segment_mean(cnarr, skip_low=False)[source]

Weighted average of bin log2 values.


NGS utilities.


Safely run a command and get stdout; print stderr if there’s an error.

Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.


Create dirs and move an existing file to avoid overwriting, if necessary.

If a file already exists at the given path, it is renamed with an integer suffix to clear the way.

cnvlib.ngfrills.safe_write(*args, **kwds)[source]

Write to a filename or file-like object with error handling.

If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.

cnvlib.ngfrills.temp_write_text(*args, **kwds)[source]

Save text to a temporary file.

NB: This won’t work on Windows b/c the file stays open.


Utilities for multi-core parallel processing.

class cnvlib.parallel.SerialPool[source]

Bases: object

Mimic the multiprocessing.Pool interface, but run in serial.

apply_async(func, args)[source]

Just call the function.



Defines several constants used in the pipeline.

Hard-coded parameters for CNVkit. These should not change between runs.


Plotting utilities.

cnvlib.plots.chromosome_sizes(probes, to_mb=False)[source]

Create an ordered mapping of chromosome names to sizes.

cnvlib.plots.cnv_on_chromosome(axis, probes, segments, genes, background_marker=None, do_trend=False, y_min=None, y_max=None)[source]

Draw a scatter plot of probe values with CBS calls overlaid.

Argument ‘genes’ is a list of tuples: (start, end, gene name)

cnvlib.plots.cnv_on_genome(axis, probes, segments, pad, do_trend=False, y_min=None, y_max=None)[source]

Plot coverages and CBS calls for all chromosomes on one plot.

cnvlib.plots.cvg2rgb(cvg, desaturate)[source]

Choose a shade of red or blue representing log2-coverage value.

cnvlib.plots.gene_coords_by_name(probes, names)[source]

Find the chromosomal position of each named gene in probes.

Returns a dict: {chromosome: [(start, end, gene name), ...]}

cnvlib.plots.gene_coords_by_range(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]

Find the chromosomal position of all genes in a range.

Returns a dict: {chromosome: [(start, end, gene), ...]}

cnvlib.plots.group_snvs_by_segments(snv_posns, snv_freqs, segments, chrom=None)[source]

Group SNP allele frequencies by segment.

Return an iterable of: start, end, value.


Parse a chromosomal range specification.

Range spec string should look like chr1:1234-5678 or chr1:1234- or chr1:-5678, where missing start becomes 0 and missing end becomes None.


Group the tumor shift values by chromosome (for statistical testing).

cnvlib.plots.plot_x_dividers(axis, chrom_sizes, pad)[source]

Plot vertical dividers and x-axis labels given the chromosome sizes.

Returns a table of the x-position offsets of each chromosome.

Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.

cnvlib.plots.setup_chromosome(axis, probes=None, segments=None, variants=None, y_min=None, y_max=None, y_label=None)[source]

Configure axes for plotting a single chromosome’s data.

probes, segments, and variants should already be subsetted to the region that will be plotted.

cnvlib.plots.setup_genome(axis, probes, segments, variants, y_min=None, y_max=None)[source]

Configure axes for plotting a whole genomes’s data.

cnvlib.plots.snv_on_chromosome(axis, variants, segments, genes, do_trend, do_boost=False)[source]
cnvlib.plots.snv_on_genome(axis, variants, chrom_sizes, segments, do_trend, pad, do_boost=False)[source]

Plot a scatter-plot of SNP chromosomal positions and shifts.

cnvlib.plots.test_loh(bins, alpha=0.0025)[source]

Test each chromosome’s SNP shifts and the combined others’.

The statistical test is Mann-Whitney, a one-sided non-parametric test for difference in means.


Extract chromosome, start, end from a string or tuple.


“chr1” -> (“chr1”, None, None) “chr1:100-123” -> (“chr1”, 100, 123) (“chr1”, 100, 123) -> (“chr1”, 100, 123)


Signal smoothing functions.

cnvlib.smoothing.check_inputs(x, width)[source]

Transform width into a half-window size.

width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.

cnvlib.smoothing.fit_edges(x, y, wing, polyorder=3)[source]

Apply polynomial interpolation to the edges of y, in-place.

Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.

cnvlib.smoothing.outlier_iqr(a, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.


MAD-Median rule for detecting outliers.

Returns: a boolean array of the same size, where outlier indices are True.

X_i is an outlier if:

 | X_i - M |
_____________  > K ~= 2.24

 MAD / 0.6745

where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.

This is a very robust rule with the highest possible breakdown point of 0.5.


  • Davies & Gather (1993) The Identification of Multiple Outliers.
  • Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
cnvlib.smoothing.rolling_median(x, width)[source]

Rolling median with mirrored edges.

cnvlib.smoothing.rolling_outlier_iqr(x, width, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).

cnvlib.smoothing.rolling_outlier_quantile(x, width, q, m)[source]

Return a boolean mask of outliers by multiples of a quantile in a window.

Outliers are the array elements outside m times the q‘th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)

This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.

cnvlib.smoothing.rolling_outlier_std(x, width, stdevs)[source]

Return a boolean mask of outliers by stdev within a rolling window.

Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.

cnvlib.smoothing.rolling_quantile(x, width, quantile)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.rolling_std(x, width)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.smoothed(x, width, do_fit_edges=False)[source]

Smooth the values in x with the Kaiser windowed filter.



x : array-like
1-dimensional numeric data set.
width : float
Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.