Python API (cnvlib package)

Module cnvlib contents

cnvlib.read(fname)[source]

Parse a file as a copy number or copy ratio table (.cnn, .cnr).

The one function exposed at the top level, read, loads a file in CNVkit’s native format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using the CopyNumArray and GenomicArray methods available on this returned object.

Core classes

The core objects used throughout CNVkit. The base class is GenomicArray. All of these classes wrap a pandas DataFrame instance accessible through the .data attribute which can be used for any manipulations that aren’t already provided by methods in the wrapper class.

gary

A generic array of genomic positions.

class cnvlib.gary.GenomicArray(data_table, meta_dict=None)[source]

Bases: future.types.newobject.newobject

An array of genomic intervals. Base class for genomic data structures.

Can represent most BED-like tabular formats with arbitrary additional columns.

add(other)[source]

Combine this array’s data with another GenomicArray (in-place).

Any optional columns must match between both arrays.

add_columns(**columns)[source]

Create a new CNA, adding the specified extra columns to this CNA.

as_columns(**columns)[source]

Wrap the named columns in this instance’s metadata.

as_dataframe(dframe)[source]

Wrap the given pandas dataframe in this instance’s metadata.

as_rows(rows)[source]

Wrap the given rows in this instance’s metadata.

autosomes(also=())[source]

Select chromosomes w/ integer names, ignoring any ‘chr’ prefixes.

by_chromosome()[source]

Iterate over bins grouped by chromosome name.

by_ranges(other, mode='inner', keep_empty=True)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

Returns an iterable of (bin, GenomicArray of overlapping rows)). Usually used for grouping probes or SNVs by CNV segments.

mode determines what to do with bins that overlap a boundary of the selection. Values are:

  • inner: Drop the bins on the selection boundary, don’t emit them.
  • outer: Keep/emit those bins as they are.
  • trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.

Bins in this array that fall outside the other array’s bins are skipped.

chromosome
concat(others)[source]

Concatenate several GenomicArrays, keeping this array’s metadata.

This array’s data table is not implicitly included in the result.

coords(also=())[source]

Iterate over plain coordinates of each bin: chromosome, start, end.

With also, also include those columns.

Example, yielding rows in BED format:

>>> probes.coords(also=["gene", "strand"])
copy()[source]

Create an independent copy of this object.

drop_extra_columns()[source]

Remove any optional columns from this GenomicArray.

Returns a new copy with only the core columns retained:
log2 value, chromosome, start, end, bin name.
end
classmethod from_columns(columns, meta_dict=None)[source]

Create a new instance from column arrays, given as a dict.

classmethod from_rows(rows, columns=None, meta_dict=None)[source]

Create a new instance from a list of rows, as tuples or arrays.

in_range(chrom=None, start=None, end=None, mode='inner')[source]

Get the GenomicArray portion within the given genomic range.

mode works as in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.

in_ranges(chrom=None, starts=None, ends=None, mode='inner')[source]

Get the GenomicArray portion within the specified ranges.

Same as in_ranges but the starts and ends are arrays of equal length, and the output concatenates all the selected bins.

keep_columns(columns)[source]

Extract a subset of columns, reusing this instance’s metadata.

labels()[source]
match_to_bins(other, key, default=0.0, fill=False, summary_func=<function median>)[source]

Take values of the other array at each of this array’s bins.

Assign default to indices that fall outside the other array’s bins, or chromosomes that appear in self but not other.

Return an array of the key column values in other corresponding to this array’s bin locations, the same length as this array.

static row2label(row)[source]
sample_id
select(selector=None, **kwargs)[source]

Take a subset of rows where the given condition is true.

Arguments can be a function (lambda expression) returning a bool, which will be used to select True rows, and/or keyword arguments like gene=”Background” or chromosome=”chr7”, which will select rows where the keyed field equals the specified value.

shuffle()[source]

Randomize the order of bins in this array (in-place).

sort()[source]

Sort this array’s bins in-place, with smart chromosome ordering.

sort_columns()[source]

Sort this array’s columns in-place, per class definition.

start

cnary

CNVkit’s core data structure, a copy number array.

class cnvlib.cnary.CopyNumArray(data_table, meta_dict=None)[source]

Bases: cnvlib.gary.GenomicArray

An array of genomic intervals, treated like aCGH probes.

Required columns: chromosome, start, end, gene, log2

Optional columns: gc, rmask, spread, weight, probes

by_gene(ignore=('-', '.', 'CGH'))[source]

Iterate over probes grouped by gene name.

Groups each series of intergenic bins as a ‘Background’ gene; any ‘Background’ bins within a gene are grouped with that gene. Bins with names in ignore are treated as ‘Background’ bins, but retain their name.

Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.

Return an iterable of pairs of: (gene name, CNA of rows with same name)

center_all(estimator=<function nanmedian>, skip_low=False, by_chrom=True)[source]

Recenter coverage values to the autosomes’ average (in-place).

compare_sex_chromosomes(male_reference=False, skip_low=False)[source]

Compare coverage ratios of sex chromosomes versus autosomes.

drop_low_coverage()[source]

Drop bins with extremely low log2 coverage values.

These are generally bins that had no reads mapped due to sample-specific issues. A very small coverage value (log2 ratio) may have been substituted to avoid domain or divide-by-zero errors.

expect_flat_log2(is_male_reference=None)[source]

Get the uninformed expected copy ratios of each bin.

Create an array of log2 coverages like a “flat” reference.

This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).

guess_xx(male_reference=False, verbose=True)[source]

Guess whether a sample is female from chrX relative coverages.

Recommended cutoff values:
-0.5 – raw target data, not yet corrected +0.5 – probe data already corrected on a male profile
log2
residuals(segments=None)[source]

Difference in log2 value of each bin from its segment mean.

If segments are just regions (e.g. GenomicArray) with no log2 values precalculated, subtract the median of this array’s log2 values within each region. If no segments are given, subtract each chromosome’s median.

shift_xx(male_reference=False, is_xx=None)[source]

Adjust chrX coverages (divide in half) for apparent female samples.

squash_genes(summary_func=<function biweight_location>, squash_background=False, ignore=('-', '.', 'CGH'))[source]

Combine consecutive bins with the same targeted gene name.

The ignore parameter lists bin names that not be counted as genes to be output.

Parameter summary_func is a function that summarizes an array of coverage values to produce the “squashed” gene’s coverage value. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.

vary

An array of genomic intervals, treated as variant loci.

class cnvlib.vary.VariantArray(data_table, meta_dict=None)[source]

Bases: cnvlib.gary.GenomicArray

An array of genomic intervals, treated as variant loci.

Required columns: chromosome, start, end, ref, alt

baf_by_ranges(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=True)[source]

Aggregate variant (b-allele) frequencies in each given bin.

Get the average BAF in each of the bins of another genomic array: BAFs are mirrored (see mirrored_baf), grouped in each bin of the ranges genomic array (an instance of GenomicArray or a subclass), and summarized with summary_func, by default the median.

Parameters above_half and tumor_boost are the same as in mirrored_baf.

heterozygous(min_freq=None, max_freq=None)[source]
mirrored_baf(above_half=None, tumor_boost=False)[source]

Mirrored B-allele frequencies (BAFs).

If above_half is set to True or False, flip BAFs to be all above 0.5 or below 0.5, respectively, for consistency. Otherwise mirror in the direction of the majority of BAFs.

With tumor_boost, normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.

tumor_boost()[source]

TumorBoost normalization of tumor-sample allele frequencies.

De-noises the signal for detecting LOH.

Interface to CNVkit sub-commands

commands

The public API for each of the commands defined in the CNVkit workflow.

Command-line interface and corresponding API for CNVkit.

cnvlib.commands.batch_make_reference(normal_bams, target_bed, antitarget_bed, male_reference, fasta, annotate, short_names, target_avg_size, access, antitarget_avg_size, antitarget_min_size, output_reference, output_dir, processes, by_count, method)[source]

Build the CN reference from normal samples, targets and antitargets.

cnvlib.commands.batch_run_sample(bam_fname, target_bed, antitarget_bed, ref_fname, output_dir, male_reference, scatter, diagram, rlibpath, by_count, skip_low, method, processes)[source]

Run the pipeline on one BAM file.

cnvlib.commands.batch_write_coverage(bed_fname, bam_fname, out_fname, by_count, processes)[source]

Run coverage on one sample, write to file.

cnvlib.commands.csvstring(text)[source]
cnvlib.commands.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.commands.do_antitarget(target_bed, access_bed=None, avg_bin_size=100000, min_bin_size=None)[source]

Derive a background/antitarget BED file from a target BED file.

cnvlib.commands.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.commands.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.commands.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.commands.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.commands.do_gainloss(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_gender(cnarrs, is_male_reference)[source]

Guess samples’ gender from the relative coverage of chromosome X.

cnvlib.commands.do_heatmap(cnarrs, show_range=None, do_desaturate=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.commands.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]
cnvlib.commands.do_metrics(cnarrs, segments, skip_low=False)[source]

Compute coverage deviations and other metrics for self-evaluation.

cnvlib.commands.do_reference(target_fnames, antitarget_fnames, fa_fname=None, male_reference=False, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.commands.do_reference_flat(targets, antitargets, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.commands.do_scatter(cnarr, segments=None, variants=None, show_range=None, show_gene=None, background_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None)[source]

Plot probe log2 coverages and CBS calls together.

show_gene: name of gene to highligh show_range: chromosome name or coordinate string like “chr1:20-30”

cnvlib.commands.do_targets(bed_fname, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.commands.parse_args(args=None)[source]

Parse the command line.

cnvlib.commands.print_version(_args)[source]

Display this program’s version.

cnvlib.commands.verify_gender_arg(cnarr, gender_arg, is_male_reference)[source]
cnvlib.commands.zip_repeater(iterable, repeatable)[source]

Repeat a single segmentation to match the number of copy ratio inputs

The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.

antitarget

Supporting functions for the ‘antitarget’ command.

cnvlib.antitarget.find_background_regions(access_chroms, target_chroms, pad_size)[source]

Take coordinates of accessible regions and targets; emit antitargets.

cnvlib.antitarget.get_background(target_bed, access_bed, avg_bin_size, min_bin_size)[source]

Generate background intervals from target intervals.

Procedure:

  • Invert target intervals

  • Subtract the inverted targets from accessible regions

  • For each of the resulting regions:

    • Shrink by a fixed margin on each end
    • If it’s smaller than min_bin_size, skip
    • Divide into equal-size (region_size/avg_bin_size) portions
    • Emit the (chrom, start, end) coords of each portion
cnvlib.antitarget.guess_chromosome_regions(target_chroms, telomere_size)[source]

Determine (minimum) chromosome lengths from target coordinates.

call

Call copy number variants from segmented log2 ratios.

cnvlib.call.absolute_clonal(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_dataframe(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Absolute, expected and reference copy number in a DataFrame.

cnvlib.call.absolute_expect(cnarr, ploidy, is_sample_female)[source]

Absolute integer number of expected copies in each bin.

I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified gender.

cnvlib.call.absolute_pure(cnarr, ploidy, is_reference_male)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_reference(cnarr, ploidy, is_reference_male)[source]

Absolute integer number of reference copies in each bin.

I.e. the given ploidy for autosomes, 1 or 2 X according to the reference gender, and always 1 copy of Y.

cnvlib.call.absolute_threshold(cnarr, ploidy, thresholds, is_reference_male)[source]

Call integer copy number using hard thresholds for each level.

Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.

Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:

R> log2(2:6 / 4)
-1.0  -0.4150375  0.0  0.3219281  0.5849625

Allowing for random noise of +/- 0.1, the cutoffs are:

DEL(0)  <  -1.1
LOSS(1) <  -0.25
GAIN(3) >=  +0.2
AMP(4)  >=  +0.7

For germline samples, better precision could be achieved with:

LOSS(1) <  -0.4
GAIN(3) >=  +0.3
cnvlib.call.log2_ratios(cnarr, absolutes, ploidy, is_reference_male, min_abs_val=0.001, round_to_int=False)[source]

Convert absolute copy numbers to log2 ratios.

Optionally round copy numbers to integers.

Account for reference gender & ploidy of sex chromosomes.

cnvlib.call.rescale_baf(purity, observed_baf, normal_baf=0.5)[source]

Adjust B-allele frequencies for sample purity.

Math:
t_baf*purity + n_baf*(1-purity) = obs_baf obs_baf - n_baf * (1-purity) = t_baf * purity t_baf = (obs_baf - n_baf * (1-purity))/purity

coverage

Supporting functions for the ‘antitarget’ command.

cnvlib.coverage.bam_total_reads(bam_fname)[source]

Count the total number of mapped reads in a BAM file.

Uses the BAM index to do this quickly.

cnvlib.coverage.bedcov(bed_fname, bam_fname, min_mapq)[source]

Calculate depth of all regions in a BED file via samtools (pysam) bedcov.

i.e. mean pileup depth across each region.

cnvlib.coverage.interval_coverages(bed_fname, bam_fname, by_count, min_mapq, processes)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_count(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_pileup(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.region_depth_count(bamfile, chrom, start, end, gene, min_mapq)[source]

Calculate depth of a region via pysam count.

i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.

Coordinates are 0-based, per pysam.

cnvlib.coverage.rm(path)[source]

Safely remove a file.

cnvlib.coverage.to_chunks(bed_fname, chunk_size=5000)[source]

Split the bed-file into chunks for parallelization

diagram

Chromosome diagram drawing functions.

This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)

cnvlib.diagram.bc_chromosome_draw_label(self, cur_drawing, label_name)[source]

Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.

Draw a label for the chromosome. Mod: above the chromosome, not below.

cnvlib.diagram.bc_organism_draw(org, title, wrap=12)[source]

Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.

Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.

Arguments:

  • title: The output title of the produced document.
cnvlib.diagram.build_chrom_diagram(features, chr_sizes, sample_id)[source]

Create a PDF of color-coded features on chromosomes.

cnvlib.diagram.create_diagram(cnarr, segarr, threshold, min_probes, outfname, is_reference_male, is_sample_female)[source]

Create the diagram.

export

Export CNVkit objects and files to other formats.

cnvlib.export.create_chrom_ids(segments)[source]

Map chromosome names to integers in the order encountered.

cnvlib.export.export_bed(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]

Convert a copy number array to a BED-like DataFrame.

For each region in each sample (possibly filtered according to show), the columns are:

  • reference sequence name
  • start (0-indexed)
  • end
  • sample name or given label
  • integer copy number

By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.

cnvlib.export.export_nexus_basic(sample_fname)[source]

Biodiscovery Nexus Copy Number “basic” format.

Only represents one sample per file.

cnvlib.export.export_nexus_ogt(sample_fname, vcf_fname, sample_id, min_depth=20, min_weight=0.0)[source]

Biodiscovery Nexus Copy Number “Custom-OGT” format.

To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.

cnvlib.export.export_seg(sample_fnames)[source]

SEG format for copy number segments.

Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.

cnvlib.export.export_theta(tumor_segs, normal_cn)[source]

Convert tumor segments and normal .cnr or reference .cnn to THetA input.

Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.

The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.

THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount

where chromosome IDs (“chrm”) are integers 1 through 24.

cnvlib.export.export_theta_snps(varr)[source]

Generate THetA’s SNP per-allele read count “formatted.txt” files.

cnvlib.export.export_vcf(segments, ploidy, is_reference_male, is_sample_female, sample_id=None)[source]

Convert segments to Variant Call Format.

For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)

Spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf

cnvlib.export.fmt_cdt(sample_ids, table)[source]

Format as CDT.

cnvlib.export.fmt_gct(sample_ids, table)[source]
cnvlib.export.fmt_jtv(sample_ids, table)[source]

Format for Java TreeView.

cnvlib.export.merge_samples(filenames)[source]

Merge probe values from multiple samples into a 2D table (of sorts).

Input:
dict of {sample ID: (probes, values)}
Output:
list-of-tuples: (probe, log2 coverages...)
cnvlib.export.ref_means_nbins(tumor_segs, normal_cn)[source]

Extract segments’ reference mean log2 values and probe counts.

Code paths:

wt_mdn  wt_old  probes  norm -> norm, nbins
+       *       *       -       0,  wt_mdn
-       +       +       -       0,  wt_old * probes
-       +       -       -       0,  wt_old * size?
-       -       +       -       0,  probes
-       -       -       -       0,  size?

+       -       +       +       norm, probes
+       -       -       +       norm, bin counts
-       +       +       +       norm, probes
-       +       -       +       norm, bin counts
-       -       +       +       norm, probes
-       -       -       +       norm, bin counts
cnvlib.export.segments2vcf(segments, ploidy, is_reference_male, is_sample_female)[source]

Convert copy number segments to VCF records.

cnvlib.export.theta_read_counts(log2_ratio, nbins, avg_depth=500, avg_bin_width=200)[source]

Calculate segments’ read counts from log2-ratios.

Math:
nbases = read_length * read_count
and
nbases = bin_width * read_depth
where
read_depth = read_depth_ratio * avg_depth
So:
read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length

fix

Supporting functions for the ‘fix’ command.

cnvlib.fix.apply_weights(cnarr, ref_matched, epsilon=0.0001)[source]

Calculate weights for each bin.

Weights are derived from:

  • bin sizes
  • average bin coverage depths in the reference
  • the “spread” column of the reference.
cnvlib.fix.center_by_window(cnarr, fraction, sort_key)[source]

Smooth out biases according to the trait specified by sort_key.

E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.

cnvlib.fix.edge_gains(target_sizes, gap_sizes, insert_size)[source]

Calculate coverage gain from neighboring baits’ flanking reads.

Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:

(i-g)^2 / 4it

If the neighbor flank extends beyond the target (t+g < i), reduce by:

(i-t-g)^2 / 4it

If a neighbor overlaps the target, treat it as adjacent (gap size 0).

cnvlib.fix.edge_losses(target_sizes, insert_size)[source]

Calculate coverage losses at the edges of baited regions.

Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:

i/2t

If the “shoulders” extend outside the bait $(t < i), reduce by:

(i-t)^2 / 4it

on each side, or (i-t)^2 / 2it total.

cnvlib.fix.get_edge_bias(cnarr, margin)[source]

Quantify the “edge effect” of the target tile and its neighbors.

The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.

(This is not the actual change in coverage. This is just a tribute.)

cnvlib.fix.load_adjust_coverages(cnarr, ref_cnarr, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Load and filter probe coverages; correct using reference and GC.

cnvlib.fix.mask_bad_bins(cnarr)[source]

Flag the bins with excessively low or inconsistent coverage.

Returns a bool array where True indicates bins that failed the checks.

cnvlib.fix.match_ref_to_sample(ref_cnarr, samp_cnarr)[source]

Filter the reference bins to match the sample (target or antitarget).

importers

Import from other formats to the CNVkit format.

cnvlib.importers.find_picard_files(file_and_dir_names)[source]

Search the given paths for ‘targetcoverage’ CSV files.

Per the convention we use in our Picard applets, the target coverage file names end with ‘.targetcoverage.csv’; anti-target coverages end with ‘.antitargetcoverage.csv’.

cnvlib.importers.parse_theta_results(fname)[source]

Parse THetA results into a data structure.

Columns: NLL, mu, C, p*

reference

Supporting functions for the ‘reference’ command.

cnvlib.reference.bed2probes(bed_fname)[source]

Create neutral-coverage probes from intervals.

cnvlib.reference.calculate_gc_lo(subseq)[source]

Calculate the GC and lowercase (RepeatMasked) content of a string.

cnvlib.reference.combine_probes(filenames, fa_fname, is_male_reference, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Calculate the median coverage of each bin across multiple samples.

Input:
List of .cnn files, as generated by ‘coverage’ or ‘import-picard’. fa_fname: fil columns for GC and RepeatMasker genomic values.
Returns:
A single CopyNumArray summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and genomic GC content.
cnvlib.reference.get_fasta_stats(probes, fa_fname)[source]

Calculate GC and RepeatMasker content of each bin in the FASTA genome.

cnvlib.reference.reference2regions(reference, coord_only=False)[source]

Extract iterables of target and antitarget regions from a reference.

Like loading two BED files with ngfrills.parse_regions.

cnvlib.reference.warn_bad_probes(probes)[source]

Warn about target probes where coverage is poor.

Prints a formatted table to stderr.

reports

Supports the sub-commands breaks and gainloss.

Supporting functions for the text/tabular-reporting commands.

Namely: breaks, gainloss.

cnvlib.reports.gainloss_by_gene(cnarr, threshold, skip_low=False)[source]

Identify genes where average bin copy ratio value exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.gainloss_by_segment(cnarr, segments, threshold, skip_low=False)[source]

Identify genes where segmented copy ratio exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.get_breakpoints(intervals, segments, min_probes)[source]

Identify CBS segment breaks within the targeted intervals.

cnvlib.reports.get_gene_intervals(all_probes, ignore=('-', '.', 'CGH'))[source]

Tally genomic locations of each targeted gene.

Return a dict of chromosomes to a list of tuples: (gene name, start, end).

cnvlib.reports.group_by_genes(cnarr, skip_low)[source]

Group probe and coverage data by gene.

Return an iterable of genes, in chromosomal order, associated with their location and coverages:

[(gene, chrom, start, end, [coverages]), ...]

segmentation

Segmentation of copy number values.

cnvlib.segmentation.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]

Infer copy number segments from the given coverage table.

cnvlib.segmentation.drop_outliers(cnarr, width, factor)[source]

Drop outlier bins with log2 ratios too far from the trend line.

Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.

cnvlib.segmentation.repair_segments(segments, orig_probes)[source]

Post-process segmentation output.

  1. Ensure every chromosome has at least one segment.
  2. Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
cnvlib.segmentation.squash_segments(segments)[source]

Combine contiguous segments.

cnvlib.segmentation.transfer_fields(segments, cnarr, ignore=('-', '.', 'CGH'))[source]

Map gene names, weights, depths from cnarr bins to segarr segments.

Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.

target

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.target.add_refflat_names(region_rows, refflat_fname)[source]

Apply RefSeq gene names to a list of targeted regions.

cnvlib.target.assign_names(region_rows, refflat_fname, default_name='-')[source]

Replace the interval gene names with those at the same loc in refFlat.txt

cnvlib.target.emit(coords, names)[source]

Try filtering names again. Format the row for yielding.

cnvlib.target.filter_names(names, exclude=('mRNA', ))[source]

Remove less-meaningful accessions from the given set.

cnvlib.target.parse_refflat_line(line)[source]

Parse one line of refFlat.txt; return relevant info.

Pair up the exon start and end positions. Add strand direction to each chromosome as a “+”/”-” suffix (it will be stripped off later).

cnvlib.target.read_refflat_genes(fname)[source]

Parse genes; merge those with same name and overlapping regions.

Returns a dict of: {(chrom, strand): [(gene start, gene end, gene name), ...]}

cnvlib.target.shorten_labels(interval_rows)[source]

Reduce multi-accession interval labels to the minimum consistent.

So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.

e.g.

... chr1 879125 879593 + mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104 chr1 880158 880279 + ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L ...

becomes:

chr1 879125 879593 + mRNA|AF161376 chr1 880158 880279 + mRNA|AF161376

cnvlib.target.shortest_name(names)[source]

Return the shortest trimmed name from the given set.

cnvlib.target.split_targets(region_rows, avg_size)[source]

Split large tiled intervals into smaller, consecutive targets.

For each of the tiled regions:

  • Divide into equal-size (tile_size/avg_size) portions
  • Emit the (chrom, start, end) coords of each portion

Bin the regions according to avg_size.

Helper modules

core

CNV utilities.

cnvlib.core.assert_equal(msg, **values)[source]

Evaluate and compare two or more values for equality.

Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.

Example:

>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy']))
...
ValueError: Mismatch: expected = 1, saw = 2
cnvlib.core.call_quiet(*args)[source]

Safely run a command and get stdout; print stderr if there’s an error.

Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.

cnvlib.core.check_unique(items, title)[source]

Ensure all items in an iterable are identical; return that one item.

cnvlib.core.ensure_path(fname)[source]

Create dirs and move an existing file to avoid overwriting, if necessary.

If a file already exists at the given path, it is renamed with an integer suffix to clear the way.

cnvlib.core.fbase(fname)[source]

Strip directory and all extensions from a filename.

cnvlib.core.rbase(fname)[source]

Strip directory and final extension from a filename.

cnvlib.core.safe_write(*args, **kwds)[source]

Write to a filename or file-like object with error handling.

If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.

cnvlib.core.sorter_chrom(label)[source]

Create a sorting key from chromosome label.

Sort by integers first, then letters or strings. The prefix “chr” (case-insensitive), if present, is stripped automatically for sorting.

E.g. chr1 < chr2 < chr10 < chrX < chrY < chrM

cnvlib.core.sorter_chrom_at(index)[source]

Create a sort key function that gets chromosome label at a list index.

cnvlib.core.temp_write_text(*args, **kwds)[source]

Save text to a temporary file.

NB: This won’t work on Windows b/c the file stays open.

cnvlib.core.write_dataframe(outfname, dframe, header=True)[source]

Write a pandas.DataFrame to a tabular file.

cnvlib.core.write_text(outfname, text, *more_texts)[source]

Write one or more strings (blocks of text) to a file.

cnvlib.core.write_tsv(outfname, rows, colnames=None)[source]

Write rows, with optional column header, to tabular file.

metrics

Robust metrics to evaluate performance of copy number estimates.

cnvlib.metrics.confidence_interval_bootstrap(bins, alpha, bootstraps=100, smoothed=True)[source]

Confidence interval for segment mean log2 value, estimated by bootstrap.

cnvlib.metrics.ests_of_scale(deviations)[source]

Estimators of scale: standard deviation, MAD, biweight midvariance.

Calculates all of these values for an array of deviations and returns them as a tuple.

cnvlib.metrics.prediction_interval(bins, alpha)[source]

Prediction interval, estimated by percentiles.

cnvlib.metrics.segment_mean(cnarr, skip_low=False)[source]

Weighted average of bin log2 values.

ngfrills

NGS utilities.

parallel

Utilities for multi-core parallel processing.

class cnvlib.parallel.SerialFuture(result)[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Future interface.

result()[source]
class cnvlib.parallel.SerialPool[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Executor interface, but run in serial.

map(func, iterable)[source]

Just apply the function to iterable.

shutdown(wait=True)[source]

Do nothing.

submit(func, *args)[source]

Just call the function on the arguments.

cnvlib.parallel.pick_pool(*args, **kwds)[source]

params

Defines several constants used in the pipeline.

Hard-coded parameters for CNVkit. These should not change between runs.

plots

Plotting utilities.

cnvlib.plots.chromosome_sizes(probes, to_mb=False)[source]

Create an ordered mapping of chromosome names to sizes.

cnvlib.plots.cnv_on_chromosome(axis, probes, segments, genes, background_marker=None, do_trend=False, y_min=None, y_max=None)[source]

Draw a scatter plot of probe values with CBS calls overlaid.

Argument ‘genes’ is a list of tuples: (start, end, gene name)

cnvlib.plots.cnv_on_genome(axis, probes, segments, pad, do_trend=False, y_min=None, y_max=None)[source]

Plot coverages and CBS calls for all chromosomes on one plot.

cnvlib.plots.cvg2rgb(cvg, desaturate)[source]

Choose a shade of red or blue representing log2-coverage value.

cnvlib.plots.gene_coords_by_name(probes, names)[source]

Find the chromosomal position of each named gene in probes.

Returns a dict: {chromosome: [(start, end, gene name), ...]}

cnvlib.plots.gene_coords_by_range(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]

Find the chromosomal position of all genes in a range.

Returns a dict: {chromosome: [(start, end, gene), ...]}

cnvlib.plots.group_snvs_by_segments(snv_posns, snv_freqs, segments, chrom=None)[source]

Group SNP allele frequencies by segment.

Return an iterable of: start, end, value.

cnvlib.plots.parse_range_text(text)[source]

Parse a chromosomal range specification.

Range spec string should look like chr1:1234-5678 or chr1:1234- or chr1:-5678, where missing start becomes 0 and missing end becomes None.

cnvlib.plots.partition_by_chrom(chrom_snvs)[source]

Group the tumor shift values by chromosome (for statistical testing).

cnvlib.plots.plot_x_dividers(axis, chrom_sizes, pad)[source]

Plot vertical dividers and x-axis labels given the chromosome sizes.

Returns a table of the x-position offsets of each chromosome.

Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.

cnvlib.plots.setup_chromosome(axis, probes=None, segments=None, variants=None, y_min=None, y_max=None, y_label=None)[source]

Configure axes for plotting a single chromosome’s data.

probes, segments, and variants should already be subsetted to the region that will be plotted.

cnvlib.plots.setup_genome(axis, probes, segments, variants, y_min=None, y_max=None)[source]

Configure axes for plotting a whole genomes’s data.

cnvlib.plots.snv_on_chromosome(axis, variants, segments, genes, do_trend, do_boost=False)[source]
cnvlib.plots.snv_on_genome(axis, variants, chrom_sizes, segments, do_trend, pad, do_boost=False)[source]

Plot a scatter-plot of SNP chromosomal positions and shifts.

cnvlib.plots.test_loh(bins, alpha=0.0025)[source]

Test each chromosome’s SNP shifts and the combined others’.

The statistical test is Mann-Whitney, a one-sided non-parametric test for difference in means.

cnvlib.plots.unpack_range(a_range)[source]

Extract chromosome, start, end from a string or tuple.

Examples:

“chr1” -> (“chr1”, None, None) “chr1:100-123” -> (“chr1”, 100, 123) (“chr1”, 100, 123) -> (“chr1”, 100, 123)

smoothing

Signal smoothing functions.

cnvlib.smoothing.check_inputs(x, width)[source]

Transform width into a half-window size.

width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.

cnvlib.smoothing.fit_edges(x, y, wing, polyorder=3)[source]

Apply polynomial interpolation to the edges of y, in-place.

Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.

cnvlib.smoothing.outlier_iqr(a, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.

cnvlib.smoothing.outlier_mad_median(a)[source]

MAD-Median rule for detecting outliers.

Returns: a boolean array of the same size, where outlier indices are True.

X_i is an outlier if:

 | X_i - M |
_____________  > K ~= 2.24

 MAD / 0.6745

where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.

This is a very robust rule with the highest possible breakdown point of 0.5.

See:

  • Davies & Gather (1993) The Identification of Multiple Outliers.
  • Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
cnvlib.smoothing.rolling_median(x, width)[source]

Rolling median with mirrored edges.

cnvlib.smoothing.rolling_outlier_iqr(x, width, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).

cnvlib.smoothing.rolling_outlier_quantile(x, width, q, m)[source]

Return a boolean mask of outliers by multiples of a quantile in a window.

Outliers are the array elements outside m times the q‘th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)

This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.

cnvlib.smoothing.rolling_outlier_std(x, width, stdevs)[source]

Return a boolean mask of outliers by stdev within a rolling window.

Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.

cnvlib.smoothing.rolling_quantile(x, width, quantile)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.rolling_std(x, width)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.smoothed(x, width, do_fit_edges=False)[source]

Smooth the values in x with the Kaiser windowed filter.

See: https://en.wikipedia.org/wiki/Kaiser_window

Parameters:

x : array-like
1-dimensional numeric data set.
width : float
Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.