cnvlib package

Module cnvlib contents

The one function exposed at the top level, read, loads a file in CNVkit’s BED-like tabular format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using just the CopyNumArray and GenomicArray methods available on this returned object (see Core classes).

To load other file formats, see Tabular file I/O (tabio). To run functions equivalent to CNVkit commands within Python, see Interface to CNVkit sub-commands.

Core classes

The core objects used throughout CNVkit. The base class is GenomicArray from scikit-genome package. All of these classes wrap a pandas DataFrame instance, which is accessible through the .data attribute and can be used for any manipulations that aren’t already provided by methods in the wrapper class.

cnary

CNVkit’s core data structure, a copy number array.

class cnvlib.cnary.CopyNumArray(data_table, meta_dict=None)[source]

Bases: skgenome.gary.GenomicArray

An array of genomic intervals, treated like aCGH probes.

Required columns: chromosome, start, end, gene, log2

Optional columns: gc, rmask, spread, weight, probes

by_gene(ignore=('-', '.', 'CGH'))[source]

Iterate over probes grouped by gene name.

Group each series of intergenic bins as an “Antitarget” gene; any “Antitarget” bins within a gene are grouped with that gene.

Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.

Parameters:ignore (list or tuple of str) – Gene names to treat as “Antitarget” bins instead of real genes, grouping these bins with the surrounding gene or intergenic region. These bins will still retain their name in the output.
Yields:tuple – Pairs of: (gene name, CNA of rows with same name)
center_all(estimator=<unbound method Series.median>, by_chrom=True, skip_low=False, verbose=False)[source]

Re-center log2 values to the autosomes’ average (in-place).

Parameters:
  • estimator (str or callable) – Function to estimate central tendency. If a string, must be one of ‘mean’, ‘median’, ‘mode’, ‘biweight’ (for biweight location). Median by default.
  • skip_low (bool) – Whether to drop very-low-coverage bins (via drop_low_coverage) before estimating the center value.
  • by_chrom (bool) – If True, first apply estimator to each chromosome separately, then apply estimator to the per-chromosome values, to reduce the impact of uneven targeting or extreme aneuploidy. Otherwise, apply estimator to all log2 values directly.
compare_sex_chromosomes(male_reference=False, skip_low=False)[source]

Compare coverage ratios of sex chromosomes versus autosomes.

Perform 4 Mood’s median tests of the log2 coverages on chromosomes X and Y, separately shifting for assumed male and female chromosomal sex. Compare the chi-squared values obtained to infer whether the male or female assumption fits the data better.

Parameters:
  • male_reference (bool) – Whether a male reference copy number profile was used to normalize the data. If so, a male sample should have log2 values of 0 on X and Y, and female +1 on X, deep negative (below -3) on Y. Otherwise, a male sample should have log2 values of -1 on X and 0 on Y, and female 0 on X, deep negative (below -3) on Y.
  • skip_low (bool) – If True, drop very-low-coverage bins (via drop_low_coverage) before comparing log2 coverage ratios. Included for completeness, but shouldn’t affect the result much since the M-W test is nonparametric and p-values are not used here.
Returns:

  • bool – True if the sample appears male.
  • dict – Calculated values used for the inference: relative log2 ratios of chromosomes X and Y versus the autosomes; the Mann-Whitney U values from each test; and ratios of U values for male vs. female assumption on chromosomes X and Y.

drop_low_coverage(verbose=False)[source]

Drop bins with extremely low log2 coverage or copy ratio values.

These are generally bins that had no reads mapped due to sample-specific issues. A very small log2 ratio or coverage value may have been substituted to avoid domain or divide-by-zero errors.

expect_flat_log2(is_male_reference=None)[source]

Get the uninformed expected copy ratios of each bin.

Create an array of log2 coverages like a “flat” reference.

This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).

guess_xx(male_reference=False, verbose=True)[source]

Detect chromosomal sex; return True if a sample is probably female.

Uses compare_sex_chromosomes to calculate coverage ratios of the X and Y chromosomes versus autosomes.

Parameters:
  • male_reference (bool) – Was this sample normalized to a male reference copy number profile?
  • verbose (bool) – If True, print (i.e. log to console) the ratios of the log2 coverages of the X and Y chromosomes versus autosomes, the “maleness” ratio of male vs. female expectations for each sex chromosome, and the inferred chromosomal sex.
Returns:

True if the coverage ratios indicate the sample is female.

Return type:

bool

log2
residuals(segments=None)[source]

Difference in log2 value of each bin from its segment mean.

Parameters:segments (GenomicArray, CopyNumArray, or None) –

Determines the “mean” value to which self log2 values are relative:

  • If CopyNumArray, use the log2 values as the segment means to subtract.
  • If GenomicArray with no log2 values, group self by these ranges and subtract each group’s median log2 value.
  • If None, subtract each chromosome’s median.
Returns:Residual log2 values from self relative to segments; same length as self.
Return type:array
shift_xx(male_reference=False, is_xx=None)[source]

Adjust chrX log2 ratios to match the ploidy of the reference sex.

I.e. add 1 to chrX log2 ratios for a male sample vs. female reference, or subtract 1 for a female sample vs. male reference, so that chrX log2 values are comparable across samples with different chromosomal sex.

smoothed(window=None, by_arm=True)[source]

Smooth log2 values with a sliding window.

Account for chromosome and (optionally) centromere boundaries. Use bin weights if present.

Returns:Smoothed log2 values from self, the same length as self.
Return type:array
squash_genes(summary_func=<function biweight_location>, squash_antitarget=False, ignore=('-', '.', 'CGH'))[source]

Combine consecutive bins with the same targeted gene name.

Parameters:
  • summary_func (callable) – Function to summarize an array of log2 values to produce a new log2 value for a “squashed” (i.e. reduced) region. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.
  • squash_antitarget (bool) – If True, also reduce consecutive “Antitarget” bins into a single bin. Otherwise, keep “Antitarget” and ignored bins as they are in the output.
  • ignore (list or tuple of str) – Bin names to be treated as “Antitarget” instead of as unique genes.
Returns:

Another, usually smaller, copy of self with each gene’s bins reduced to a single bin with appropriate values.

Return type:

CopyNumArray

vary

An array of genomic intervals, treated as variant loci.

class cnvlib.vary.VariantArray(data_table, meta_dict=None)[source]

Bases: skgenome.gary.GenomicArray

An array of genomic intervals, treated as variant loci.

Required columns: chromosome, start, end, ref, alt

baf_by_ranges(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=False)[source]

Aggregate variant (b-allele) frequencies in each given bin.

Get the average BAF in each of the bins of another genomic array: BAFs are mirrored above/below 0.5 (per above_half), grouped in each bin of ranges, and summarized into one value per bin with summary_func (default median).

Parameters:
  • ranges (GenomicArray or subclass) – Bins for grouping the variants in self.
  • above_half (bool) – The same as in mirrored_baf.
  • tumor_boost (bool) – The same as in mirrored_baf.
Returns:

Average b-allele frequency in each range; same length as ranges. May contain NaN values where no variants overlap a range.

Return type:

float array

heterozygous()[source]

Subset to only heterozygous variants.

Use ‘zygosity’ or ‘n_zygosity’ genotype values (if present) to exclude variants with value 0.0 or 1.0. If these columns are missing, or there are no heterozygous variants, then return the full (input) set of variants.

Returns:The subset of self with heterozygous genotype, or allele frequency between the specified thresholds.
Return type:VariantArray
mirrored_baf(above_half=None, tumor_boost=False)[source]

Mirrored B-allele frequencies (BAFs).

Parameters:
  • above_half (bool or None) – If specified, flip BAFs to be all above 0.5 (True) or below 0.5 (False), respectively, for consistency. Otherwise, if None, mirror in the direction of the majority of BAFs.
  • tumor_boost (bool) – Normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.
Returns:

Mirrored b-allele frequencies, the same length as self. May contain NaN values.

Return type:

float array

tumor_boost()[source]

TumorBoost normalization of tumor-sample allele frequencies.

De-noises the signal for detecting LOH.

See: TumorBoost, Bengtsson et al. 2010

zygosity_from_freq(het_freq=0.0, hom_freq=1.0)[source]

Set zygosity (genotype) according to allele frequencies.

Creates or replaces ‘zygosity’ column if ‘alt_freq’ column is present, and ‘n_zygosity’ if ‘n_alt_freq’ is present.

Parameters:
  • het_freq (float) – Assign zygosity 0.5 (heterozygous), otherwise 0.0 (i.e. reference genotype), to variants with alt allele frequency of at least this value.
  • hom_freq (float) – Assign zygosity 1.0 (homozygous) to variants with alt allele frequency of at least this value.

Interface to CNVkit sub-commands

commands

The public API for each of the commands defined in the CNVkit workflow.

Command-line interface and corresponding API for CNVkit.

cnvlib.commands.do_target(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.commands.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000, skip_noncanonical=True)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.commands.do_antitarget(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]

Derive off-targt (“antitarget”) bins from target regions.

cnvlib.commands.do_autobin(bam_fname, method, targets=None, access=None, bp_per_bin=100000.0, target_min_size=20, target_max_size=50000, antitarget_min_size=500, antitarget_max_size=1000000)[source]

Quickly calculate reasonable bin sizes from BAM read counts.

Parameters:
  • bam_fname (string) – BAM filename.
  • method (string) – One of: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), ‘hybrid’ (hybridization capture).
  • targets (GenomicArray) – Targeted genomic regions (for ‘hybrid’ and ‘amplicon’).
  • access (GenomicArray) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
  • bp_per_bin (int) – Desired number of sequencing read nucleotide bases mapped to each bin.
Returns:

((target depth, target avg. bin size),

(antitarget depth, antitarget avg. bin size))

Return type:

2-tuple of 2-tuples

cnvlib.commands.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.commands.do_reference(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.commands.do_reference_flat(targets, antitargets=None, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.commands.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.commands.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, min_weight=0, save_dataframe=False, rlibpath=None, rscript_path='Rscript', processes=1)[source]

Infer copy number segments from the given coverage table.

cnvlib.commands.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.commands.do_scatter(cnarr, segments=None, variants=None, show_range=None, show_gene=None, do_trend=False, by_bin=False, window_width=1000000.0, y_min=None, y_max=None, antitarget_marker=None, segment_color='darkorange', title=None)[source]

Plot probe log2 coverages and segmentation calls together.

cnvlib.commands.do_heatmap(cnarrs, show_range=None, do_desaturate=False, by_bin=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.commands.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.commands.do_genemetrics(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_genemetrics(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_sex(cnarrs, is_male_reference)[source]

Guess samples’ sex from the relative coverage of chromosomes X and Y.

cnvlib.commands.do_sex(cnarrs, is_male_reference)[source]

Guess samples’ sex from the relative coverage of chromosomes X and Y.

cnvlib.commands.do_metrics(cnarrs, segments=None, skip_low=False)[source]

Compute coverage deviations and other metrics for self-evaluation.

cnvlib.commands.do_segmetrics(cnarr, segarr, location_stats=(), spread_stats=(), interval_stats=(), alpha=0.05, bootstraps=100)[source]

Compute segment-level metrics from bin-level log2 ratios.

cnvlib.commands.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]
cnvlib.commands.do_import_rna(gene_count_fnames, in_format, gene_resource_fname, correlations_fname=None, normal_fnames=(), do_gc=True, do_txlen=True, max_log2=3)[source]

Convert a cohort of per-gene read counts to CNVkit .cnr format.

The expected data source is TCGA gene-level expression counts for individual samples, but other sources should be fine, too.

The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.

access

List the locations of accessible sequence regions in a FASTA file.

Inaccessible regions, e.g. telomeres and centromeres, are masked out with N in the reference genome sequence; this script scans those to identify the coordinates of the accessible regions (those between the long spans of N’s).

cnvlib.access.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000, skip_noncanonical=True)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.access.drop_noncanonical_contigs(region_tups)[source]

Drop contigs with noncanonical names.

region_tups is an iterable of (chrom, start, end) tuples.

Yield the same, but dropping noncanonical chrom.

cnvlib.access.get_regions(fasta_fname)[source]

Find accessible sequence regions (those not masked out with ‘N’).

cnvlib.access.join_regions(regions, min_gap_size)[source]

Filter regions, joining those separated by small gaps.

cnvlib.access.log_this(chrom, run_start, run_end)[source]

Log a coordinate range, then return it as a tuple.

antitarget

Supporting functions for the ‘antitarget’ command.

cnvlib.antitarget.compare_chrom_names(a_regions, b_regions)[source]
cnvlib.antitarget.do_antitarget(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]

Derive off-targt (“antitarget”) bins from target regions.

cnvlib.antitarget.drop_noncanonical_contigs(accessible, targets, verbose=True)[source]

Drop contigs that are not targeted or canonical chromosomes.

Antitargets will be binned over chromosomes that:

  • Appear in the sequencing-accessible regions of the reference genome sequence, and
  • Contain at least one targeted region, or
  • Are named like a canonical chromosome (1-22,X,Y for human)

This allows antitarget binning to pick up canonical chromosomes that do not contain any targets, as well as non-canonical or oddly named chromosomes that were targeted.

cnvlib.antitarget.get_antitargets(targets, accessible, avg_bin_size, min_bin_size)[source]

Generate antitarget intervals between/around target intervals.

Procedure:

  • Invert target intervals

  • Subtract the inverted targets from accessible regions

  • For each of the resulting regions:

    • Shrink by a fixed margin on each end
    • If it’s smaller than min_bin_size, skip
    • Divide into equal-size (region_size/avg_bin_size) portions
    • Emit the (chrom, start, end) coords of each portion
cnvlib.antitarget.guess_chromosome_regions(targets, telomere_size)[source]

Determine (minimum) chromosome lengths from target coordinates.

cnvlib.antitarget.is_canonical_contig_name(name)[source]

autobin

Estimate reasonable bin sizes from BAM read counts or depths.

cnvlib.autobin.average_depth(rc_table, read_length)[source]

Estimate the average read depth across the genome.

Returns:Median of the per-chromosome mean read depths, weighted by chromosome size.
Return type:float
cnvlib.autobin.do_autobin(bam_fname, method, targets=None, access=None, bp_per_bin=100000.0, target_min_size=20, target_max_size=50000, antitarget_min_size=500, antitarget_max_size=1000000)[source]

Quickly calculate reasonable bin sizes from BAM read counts.

Parameters:
  • bam_fname (string) – BAM filename.
  • method (string) – One of: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), ‘hybrid’ (hybridization capture).
  • targets (GenomicArray) – Targeted genomic regions (for ‘hybrid’ and ‘amplicon’).
  • access (GenomicArray) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
  • bp_per_bin (int) – Desired number of sequencing read nucleotide bases mapped to each bin.
Returns:

((target depth, target avg. bin size),

(antitarget depth, antitarget avg. bin size))

Return type:

2-tuple of 2-tuples

cnvlib.autobin.hybrid(rc_table, read_len, bam_fname, targets, access=None)[source]

Hybrid capture sequencing.

cnvlib.autobin.idxstats2ga(table, bam_fname)[source]
cnvlib.autobin.midsize_file(fnames)[source]

Select the median-size file from several given filenames.

If an even number of files is given, selects the file just below the median.

cnvlib.autobin.region_size_by_chrom(regions)[source]
cnvlib.autobin.sample_midsize_regions(regions, max_num)[source]

Randomly select a subset of up to max_num regions.

cnvlib.autobin.sample_region_cov(bam_fname, regions, max_num=100)[source]

Calculate read depth in a randomly sampled subset of regions.

cnvlib.autobin.shared_chroms(*tables)[source]

Intersection of DataFrame .chromosome values.

cnvlib.autobin.total_region_size(regions)[source]

Aggregate area of all genomic ranges in regions.

cnvlib.autobin.update_chrom_length(rc_table, regions)[source]

batch

The ‘batch’ command.

cnvlib.batch.batch_make_reference(normal_bams, target_bed, antitarget_bed, male_reference, fasta, annotate, short_names, target_avg_size, access_bed, antitarget_avg_size, antitarget_min_size, output_reference, output_dir, processes, by_count, method)[source]

Build the CN reference from normal samples, targets and antitargets.

cnvlib.batch.batch_run_sample(bam_fname, target_bed, antitarget_bed, ref_fname, output_dir, male_reference, plot_scatter, plot_diagram, rlibpath, by_count, skip_low, method, processes)[source]

Run the pipeline on one BAM file.

cnvlib.batch.batch_write_coverage(bed_fname, bam_fname, out_fname, by_count, processes)[source]

Run coverage on one sample, write to file.

call

Call copy number variants from segmented log2 ratios.

cnvlib.call.absolute_clonal(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_dataframe(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Absolute, expected and reference copy number in a DataFrame.

cnvlib.call.absolute_expect(cnarr, ploidy, is_sample_female)[source]

Absolute integer number of expected copies in each bin.

I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified chromosomal sex.

cnvlib.call.absolute_pure(cnarr, ploidy, is_reference_male)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_reference(cnarr, ploidy, is_reference_male)[source]

Absolute integer number of reference copies in each bin.

I.e. the given ploidy for autosomes, 1 or 2 X according to the reference sex, and always 1 copy of Y.

cnvlib.call.absolute_threshold(cnarr, ploidy, thresholds, is_reference_male)[source]

Call integer copy number using hard thresholds for each level.

Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.

Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:

R> log2(2:6 / 4)
-1.0  -0.4150375  0.0  0.3219281  0.5849625

Allowing for random noise of +/- 0.1, the cutoffs are:

DEL(0)  <  -1.1
LOSS(1) <  -0.25
GAIN(3) >=  +0.2
AMP(4)  >=  +0.7

For germline samples, better precision could be achieved with:

LOSS(1) <  -0.4
GAIN(3) >=  +0.3
cnvlib.call.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.call.log2_ratios(cnarr, absolutes, ploidy, is_reference_male, min_abs_val=0.001, round_to_int=False)[source]

Convert absolute copy numbers to log2 ratios.

Optionally round copy numbers to integers.

Account for reference sex & ploidy of sex chromosomes.

cnvlib.call.rescale_baf(purity, observed_baf, normal_baf=0.5)[source]

Adjust B-allele frequencies for sample purity.

Math:

t_baf*purity + n_baf*(1-purity) = obs_baf
obs_baf - n_baf * (1-purity) = t_baf * purity
t_baf = (obs_baf - n_baf * (1-purity))/purity

coverage

Supporting functions for the ‘antitarget’ command.

cnvlib.coverage.bedcov(bed_fname, bam_fname, min_mapq)[source]

Calculate depth of all regions in a BED file via samtools (pysam) bedcov.

i.e. mean pileup depth across each region.

cnvlib.coverage.detect_bedcov_columns(text)[source]

Determine which ‘bedcov’ output columns to keep.

Format is the input BED plus a final appended column with the count of basepairs mapped within each row’s region. The input BED might have 3 columns (regions without names), 4 (named regions), or more (arbitrary columns after ‘gene’).

cnvlib.coverage.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.coverage.interval_coverages(bed_fname, bam_fname, by_count, min_mapq, processes)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_count(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_pileup(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.region_depth_count(bamfile, chrom, start, end, gene, min_mapq)[source]

Calculate depth of a region via pysam count.

i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.

Coordinates are 0-based, per pysam.

diagram

Chromosome diagram drawing functions.

This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)

cnvlib.diagram.bc_chromosome_draw_label(self, cur_drawing, label_name)[source]

Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.

Draw a label for the chromosome. Mod: above the chromosome, not below.

cnvlib.diagram.bc_organism_draw(org, title, wrap=12)[source]

Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.

Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.

Parameters:
  • org – The chromosome diagram object being modified.
  • title (str) – The output title of the produced document.
  • wrap (int) – Maximum number of chromosomes per row; the remainder will be wrapped to the next row(s).
cnvlib.diagram.build_chrom_diagram(features, chr_sizes, sample_id, title=None)[source]

Create a PDF of color-coded features on chromosomes.

cnvlib.diagram.create_diagram(cnarr, segarr, threshold, min_probes, outfname, title=None)[source]

Create the diagram.

export

Export CNVkit objects and files to other formats.

cnvlib.export.assign_ci_start_end(segarr, cnarr)[source]

Assign ci_start and ci_end fields to segments.

Values for each segment indicate the CI boundary points within that segment, i.e. the right CI boundary for the left-side breakpoint (segment start), and left CI boundary for the right-side breakpoint (segment end).

This is a little unintuitive because the CI refers to the breakpoint, not the segment, but we’re storing the info in an array of segments.

Calculation: Just use the boundaries of the bins left- and right-adjacent to each segment breakpoint.

cnvlib.export.export_bed(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]

Convert a copy number array to a BED-like DataFrame.

For each region in each sample (possibly filtered according to show), the columns are:

  • reference sequence name
  • start (0-indexed)
  • end
  • sample name or given label
  • integer copy number

By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.

cnvlib.export.export_gistic_markers(cnr_fnames)[source]

Generate a GISTIC 2.0 “markers” file from a set of .cnr files.

GISTIC documentation:

ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm http://genepattern.broadinstitute.org/ftp/distribution/genepattern/modules_public_server_doc/GISTIC2.pdf http://gdac.broadinstitute.org/runs/analyses__2013_05_23/reports/cancer/KICH-TP/CopyNumber_Gistic2/nozzle.html

The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. The column headers are:

  1. Marker Name
  2. Chromosome
  3. Marker Position (in bases)

GISTIC also needs an accompanying SEG file generated from corresponding .cns files.

cnvlib.export.export_nexus_basic(cnarr)[source]

Biodiscovery Nexus Copy Number “basic” format.

Only represents one sample per file.

cnvlib.export.export_nexus_ogt(cnarr, varr, min_weight=0.0)[source]

Biodiscovery Nexus Copy Number “Custom-OGT” format.

To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.

cnvlib.export.export_seg(sample_fnames, chrom_ids=False)[source]

SEG format for copy number segments.

Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.

cnvlib.export.export_theta(tumor_segs, normal_cn)[source]

Convert tumor segments and normal .cnr or reference .cnn to THetA input.

Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.

The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.

THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount

where chromosome IDs (“chrm”) are integers 1 through 24.

cnvlib.export.export_theta_snps(varr)[source]

Generate THetA’s SNP per-allele read count “formatted.txt” files.

cnvlib.export.export_vcf(segments, ploidy, is_reference_male, is_sample_female, sample_id=None, cnarr=None)[source]

Convert segments to Variant Call Format.

For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)

Spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf

cnvlib.export.fmt_cdt(sample_ids, table)[source]

Format as CDT.

See:

cnvlib.export.fmt_gct(sample_ids, table)[source]
cnvlib.export.fmt_jtv(sample_ids, table)[source]

Format for Java TreeView.

cnvlib.export.merge_samples(filenames)[source]

Merge probe values from multiple samples into a 2D table (of sorts).

Input:
dict of {sample ID: (probes, values)}
Output:
list-of-tuples: (probe, log2 coverages…)
cnvlib.export.ref_means_nbins(tumor_segs, normal_cn)[source]

Extract segments’ reference mean log2 values and probe counts.

Code paths:

wt_mdn  wt_old  probes  norm -> norm, nbins
+       *       *       -       0,  wt_mdn
-       +       +       -       0,  wt_old * probes
-       +       -       -       0,  wt_old * size?
-       -       +       -       0,  probes
-       -       -       -       0,  size?

+       -       +       +       norm, probes
+       -       -       +       norm, bin counts
-       +       +       +       norm, probes
-       +       -       +       norm, bin counts
-       -       +       +       norm, probes
-       -       -       +       norm, bin counts
cnvlib.export.segments2vcf(segments, ploidy, is_reference_male, is_sample_female)[source]

Convert copy number segments to VCF records.

cnvlib.export.theta_read_counts(log2_ratio, nbins, avg_depth=500, avg_bin_width=200, read_len=100)[source]

Calculate segments’ read counts from log2-ratios.

Math:
nbases = read_length * read_count
and
nbases = bin_width * read_depth
where
read_depth = read_depth_ratio * avg_depth
So:
read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length

fix

Supporting functions for the ‘fix’ command.

cnvlib.fix.apply_weights(cnarr, ref_matched, epsilon=0.0001)[source]

Calculate weights for each bin.

Weights are derived from:

  • bin sizes
  • average bin coverage depths in the reference
  • the “spread” column of the reference.
cnvlib.fix.center_by_window(cnarr, fraction, sort_key)[source]

Smooth out biases according to the trait specified by sort_key.

E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.

cnvlib.fix.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.fix.edge_gains(target_sizes, gap_sizes, insert_size)[source]

Calculate coverage gain from neighboring baits’ flanking reads.

Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:

.. math :: (i-g)^2 / 4it

If the neighbor flank extends beyond the target (t+g < i), reduce by:

.. math :: (i-t-g)^2 / 4it

If a neighbor overlaps the target, treat it as adjacent (gap size 0).

cnvlib.fix.edge_losses(target_sizes, insert_size)[source]

Calculate coverage losses at the edges of baited regions.

Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:

\[i/2t\]

If the “shoulders” extend outside the bait $(t < i), reduce by:

\[(i-t)^2 / 4it\]

on each side, or (i-t)^2 / 2it total.

cnvlib.fix.get_edge_bias(cnarr, margin)[source]

Quantify the “edge effect” of the target tile and its neighbors.

The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.

(This is not the actual change in coverage. This is just a tribute.)

cnvlib.fix.load_adjust_coverages(cnarr, ref_cnarr, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Load and filter probe coverages; correct using reference and GC.

cnvlib.fix.mask_bad_bins(cnarr)[source]

Flag the bins with excessively low or inconsistent coverage.

Returns:A boolean array where True indicates bins that failed the checks.
Return type:np.array
cnvlib.fix.match_ref_to_sample(ref_cnarr, samp_cnarr)[source]

Filter the reference bins to match the sample (target or antitarget).

heatmap

The ‘heatmap’ command.

cnvlib.heatmap.do_heatmap(cnarrs, show_range=None, do_desaturate=False, by_bin=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.heatmap.set_colorbar(axis)[source]

importers

Import from other formats to the CNVkit format.

cnvlib.importers.do_import_picard(fname, too_many_no_coverage=100)[source]
cnvlib.importers.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]
cnvlib.importers.parse_theta_results(fname)[source]

Parse THetA results into a data structure.

Columns: NLL, mu, C, p*

cnvlib.importers.unpipe_name(name)[source]

Fix the duplicated gene names Picard spits out.

Return a string containing the single gene name, sans duplications and pipe characters.

Picard CalculateHsMetrics combines the labels of overlapping intervals by joining all labels with ‘|’, e.g. ‘BRAF|BRAF’ – no two distinct targeted genes actually overlap, though, so these dupes are redundant. Meaningless target names are dropped, e.g. ‘CGH|FOO|-‘ resolves as ‘FOO’. In case of ambiguity, the longest name is taken, e.g. “TERT|TERT Promoter” resolves as “TERT Promoter”.

import_rna

cnvlib.import_rna.aggregate_gene_counts(filenames)[source]
cnvlib.import_rna.aggregate_rsem(fnames)[source]

Pull out the expected read counts from each RSEM file.

The format of RSEM’s *_rsem.genes.results output files is tab-delimited with a header row. We extract the Ensembl gene ID, expected read counts, and transcript lengths from each file.

Returns:
  • sample_counts (DataFrame) – Row index is Ensembl gene ID, column index is filename.
  • tx_lengths (Series) – Gene lengths.
cnvlib.import_rna.do_import_rna(gene_count_fnames, in_format, gene_resource_fname, correlations_fname=None, normal_fnames=(), do_gc=True, do_txlen=True, max_log2=3)[source]

Convert a cohort of per-gene read counts to CNVkit .cnr format.

The expected data source is TCGA gene-level expression counts for individual samples, but other sources should be fine, too.

metrics

Robust metrics to evaluate performance of copy number estimates.

cnvlib.metrics.do_metrics(cnarrs, segments=None, skip_low=False)[source]

Compute coverage deviations and other metrics for self-evaluation.

cnvlib.metrics.ests_of_scale(deviations)[source]

Estimators of scale: standard deviation, MAD, biweight midvariance.

Calculates all of these values for an array of deviations and returns them as a tuple.

cnvlib.metrics.zip_repeater(iterable, repeatable)[source]

Repeat a single segmentation to match the number of copy ratio inputs

reference

Supporting functions for the ‘reference’ command.

cnvlib.reference.bed2probes(bed_fname)[source]

Create a neutral-coverage CopyNumArray from a file of regions.

cnvlib.reference.calculate_gc_lo(subseq)[source]

Calculate the GC and lowercase (RepeatMasked) content of a string.

cnvlib.reference.combine_probes(filenames, fa_fname, is_male_reference, sexes, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Calculate the median coverage of each bin across multiple samples.

Parameters:
  • filenames (list) – List of string filenames, corresponding to targetcoverage.cnn and antitargetcoverage.cnn files, as generated by ‘coverage’ or ‘import-picard’.
  • fa_fname (str) – Reference genome sequence in FASTA format, used to extract GC and RepeatMasker content of each genomic bin.
  • is_male_reference (bool) –
  • skip_low (bool) –
  • fix_gc (bool) –
  • fix_edge (bool) –
  • fix_rmask (bool) –
Returns:

One object summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and GC content.

Return type:

CopyNumArray

cnvlib.reference.do_reference(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.reference.do_reference_flat(targets, antitargets=None, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.reference.fasta_extract_regions(fa_fname, intervals)[source]

Extract an iterable of regions from an indexed FASTA file.

Input: FASTA file name; iterable of (seq_id, start, end) (1-based) Output: iterable of string sequences.

cnvlib.reference.get_fasta_stats(cnarr, fa_fname)[source]

Calculate GC and RepeatMasker content of each bin in the FASTA genome.

cnvlib.reference.infer_sexes(cnn_fnames, is_male_reference)[source]

Map sample IDs to inferred chromosomal sex, where possible.

For samples where the source file is empty or does not include either sex chromosome, that sample ID will not be in the returned dictionary.

cnvlib.reference.reference2regions(refarr)[source]

Split reference into target and antitarget regions.

cnvlib.reference.warn_bad_bins(cnarr, max_name_width=50)[source]

Warn about target bins where coverage is poor.

Prints a formatted table to stderr.

reports

Supports the sub-commands breaks and genemetrics.

Supporting functions for the text/tabular-reporting commands.

Namely: breaks, genemetrics.

cnvlib.reports.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.reports.do_genemetrics(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.reports.gene_metrics_by_gene(cnarr, threshold, skip_low=False)[source]

Identify genes where average bin copy ratio value exceeds threshold.

NB: Adjust the sample’s sex-chromosome log2 values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.gene_metrics_by_segment(cnarr, segments, threshold, skip_low=False)[source]

Identify genes where segmented copy ratio exceeds threshold.

In the output table, show each segment’s weight and probes as segment_weight and segment_probes, alongside the gene-level weight and probes.

NB: Adjust the sample’s sex-chromosome log2 values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.get_breakpoints(intervals, segments, min_probes)[source]

Identify segment breaks within the targeted intervals.

cnvlib.reports.get_gene_intervals(all_probes, ignore=('-', '.', 'CGH'))[source]

Tally genomic locations of each targeted gene.

Return a dict of chromosomes to a list of tuples: (gene name, starts, end), where gene name is a string, starts is a sorted list of probe start positions, and end is the last probe’s end position as an integer. (The endpoints are redundant since probes are adjacent.)

cnvlib.reports.group_by_genes(cnarr, skip_low)[source]

Group probe and coverage data by gene.

Return an iterable of genes, in chromosomal order, associated with their location and coverages:

[(gene, chrom, start, end, [coverages]), …]

scatter

Command-line interface and corresponding API for CNVkit.

cnvlib.scatter.choose_segment_color(segment, highlight_color, default_bright=True)[source]

Choose a display color based on a segment’s CNA status.

Uses the fields added by the ‘call’ command. If these aren’t present, use highlight_color for everything.

For sex chromosomes, some single-copy deletions or gains might not be highlighted, since sample sex isn’t used to infer the neutral ploidies.

cnvlib.scatter.chromosome_scatter(cnarr, segments, variants, show_range, show_gene, antitarget_marker, do_trend, by_bin, window_width, y_min, y_max, title, segment_color)[source]

Plot a specified region on one chromosome.

Possibilities:

     Options | Shown
------------ | --------
-c      | -g | Genes | Region
------- | -- | ----- | ------
-       | +  | given | auto: gene(s) + margin
chr     | -  | none  | whole chrom
chr     | +  | given | whole chrom
chr:s-e | -  | all   | given
chr:s-e | +  | given | given
cnvlib.scatter.cnv_on_chromosome(axis, probes, segments, genes, antitarget_marker=None, do_trend=False, x_limits=None, y_min=None, y_max=None, segment_color='darkorange')[source]

Draw a scatter plot of probe values with optional segments overlaid.

Parameters:genes (list) – Of tuples: (start, end, gene name)
cnvlib.scatter.cnv_on_genome(axis, probes, segments, do_trend=False, y_min=None, y_max=None, segment_color='darkorange')[source]

Plot bin ratios and/or segments for all chromosomes on one plot.

cnvlib.scatter.do_scatter(cnarr, segments=None, variants=None, show_range=None, show_gene=None, do_trend=False, by_bin=False, window_width=1000000.0, y_min=None, y_max=None, antitarget_marker=None, segment_color='darkorange', title=None)[source]

Plot probe log2 coverages and segmentation calls together.

cnvlib.scatter.genome_scatter(cnarr, segments=None, variants=None, do_trend=False, y_min=None, y_max=None, title=None, segment_color='darkorange')[source]

Plot all chromosomes, concatenated on one plot.

cnvlib.scatter.get_segment_vafs(variants, segments)[source]

Group SNP allele frequencies by segment.

Assume variants and segments were already subset to one chromosome.

Yields:tuple – (segment, value)
cnvlib.scatter.highlight_genes(axis, genes, y_posn)[source]

Show gene regions with background color and a text label.

cnvlib.scatter.select_range_genes(cnarr, segments, variants, show_range, show_gene, window_width)[source]

Determine which datapoints to show based on the given options.

Behaviors:

start/end   show_gene
   +           +       given region + genes; err if any gene outside it
   -           +       window +/- around genes
   +           -       given region, highlighting any genes within it
   -           -       whole chromosome, no genes

If show_range is a chromosome name only, no start/end positions, then the whole chromosome will be shown.

If region start/end coordinates are given and show_gene is ‘’ or ‘,’ (or all commas, etc.), then instead of highlighting all genes in the selection, no genes will be highlighted.

cnvlib.scatter.set_xlim_from(axis, probes=None, segments=None, variants=None)[source]

Configure axes for plotting a single chromosome’s data.

Parameters:
cnvlib.scatter.setup_chromosome(axis, y_min=None, y_max=None, y_label=None)[source]

Configure axes for plotting a single chromosome’s data.

cnvlib.scatter.snv_on_chromosome(axis, variants, segments, genes, do_trend, by_bin, segment_color)[source]
cnvlib.scatter.snv_on_genome(axis, variants, chrom_sizes, segments, do_trend, segment_color)[source]

Plot a scatter-plot of SNP chromosomal positions and shifts.

segmentation

Segmentation of copy number values.

cnvlib.segmentation.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, min_weight=0, save_dataframe=False, rlibpath=None, rscript_path='Rscript', processes=1)[source]

Infer copy number segments from the given coverage table.

cnvlib.segmentation.drop_outliers(cnarr, width, factor)[source]

Drop outlier bins with log2 ratios too far from the trend line.

Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.

cnvlib.segmentation.transfer_fields(segments, cnarr, ignore=('-', '.', 'CGH'))[source]

Map gene names, weights, depths from cnarr bins to segarr segments.

Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.

Also: Post-process segmentation output.

  1. Ensure every chromosome has at least one segment.
  2. Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).

segmetrics

Robust metrics to evaluate performance of copy number estimates.

cnvlib.segmetrics.calc_intervals(bins_log2s, weights, func)[source]

Compute a stat that yields intervals (low & high values)

cnvlib.segmetrics.confidence_interval_bootstrap(values, weights, alpha, bootstraps=100, smoothed=True)[source]

Confidence interval for segment mean log2 value, estimated by bootstrap.

cnvlib.segmetrics.do_segmetrics(cnarr, segarr, location_stats=(), spread_stats=(), interval_stats=(), alpha=0.05, bootstraps=100)[source]

Compute segment-level metrics from bin-level log2 ratios.

cnvlib.segmetrics.make_ci_func(alpha, bootstraps)[source]
cnvlib.segmetrics.make_pi_func(alpha)[source]

Prediction interval, estimated by percentiles.

cnvlib.segmetrics.segment_mean(cnarr, skip_low=False)[source]

Weighted average of bin log2 values.

target

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.target.do_target(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.target.filter_names(names, exclude=('mRNA', ))[source]

Remove less-meaningful accessions from the given set.

cnvlib.target.shorten_labels(gene_labels)[source]

Reduce multi-accession interval labels to the minimum consistent.

So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.

e.g.:

...
mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104
ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L
...

becomes:

...
mRNA|AF161376
mRNA|AF161376
...
cnvlib.target.shortest_name(names)[source]

Return the shortest trimmed name from the given set.

Helper modules

cmdutil

Functions reused within command-line implementations.

cnvlib.cmdutil.load_het_snps(vcf_fname, sample_id=None, normal_id=None, min_variant_depth=20, zygosity_freq=None, tumor_boost=False)[source]
cnvlib.cmdutil.read_cna(infile, sample_id=None, meta=None)[source]

Read a CNVkit file (.cnn, .cnr, .cns) to create a CopyNumArray object.

cnvlib.cmdutil.verify_sample_sex(cnarr, sex_arg, is_male_reference)[source]
cnvlib.cmdutil.write_dataframe(outfname, dframe, header=True)[source]

Write a pandas.DataFrame to a tabular file.

cnvlib.cmdutil.write_text(outfname, text, *more_texts)[source]

Write one or more strings (blocks of text) to a file.

cnvlib.cmdutil.write_tsv(outfname, rows, colnames=None)[source]

Write rows, with optional column header, to tabular file.

core

CNV utilities.

cnvlib.core.assert_equal(msg, **values)[source]

Evaluate and compare two or more values for equality.

Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.

Example:

>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy']))
...
ValueError: Mismatch: expected = 1, saw = 2
cnvlib.core.call_quiet(*args)[source]

Safely run a command and get stdout; print stderr if there’s an error.

Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.

cnvlib.core.check_unique(items, title)[source]

Ensure all items in an iterable are identical; return that one item.

cnvlib.core.ensure_path(fname)[source]

Create dirs and move an existing file to avoid overwriting, if necessary.

If a file already exists at the given path, it is renamed with an integer suffix to clear the way.

cnvlib.core.fbase(fname)[source]

Strip directory and all extensions from a filename.

cnvlib.core.temp_write_text(*args, **kwds)[source]

Save text to a temporary file.

NB: This won’t work on Windows b/c the file stays open.

descriptives

Robust estimators of central tendency and scale.

See:
https://en.wikipedia.org/wiki/Robust_measures_of_scale https://astropy.readthedocs.io/en/latest/_modules/astropy/stats/funcs.html
cnvlib.descriptives.biweight_location(a, **kwargs)[source]

Compute the biweight location for an array.

The biweight is a robust statistic for estimating the central location of a distribution.

cnvlib.descriptives.biweight_midvariance(a, **kwargs)[source]

Compute the biweight midvariance for an array.

The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.

See:

cnvlib.descriptives.gapper_scale(a, **kwargs)[source]

Scale estimator based on gaps between order statistics.

See:

  • Wainer & Thissen (1976)
  • Beers, Flynn, and Gebhardt (1990)
cnvlib.descriptives.interquartile_range(a, **kwargs)[source]

Compute the difference between the array’s first and third quartiles.

cnvlib.descriptives.mean_squared_error(a, **kwargs)[source]

Mean squared error (MSE).

By default, assume the input array a is the residuals/deviations/error, so MSE is calculated from zero. Another reference point for calculating the error can be specified with initial.

cnvlib.descriptives.median_absolute_deviation(a, **kwargs)[source]

Compute the median absolute deviation (MAD) of array elements.

The MAD is defined as: median(abs(a - median(a))).

See: https://en.wikipedia.org/wiki/Median_absolute_deviation

cnvlib.descriptives.modal_location(a, **kwargs)[source]

Return the modal value of an array’s values.

The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.

Parameters:a (np.array) – A 1-D array of floating-point values, e.g. bin log2 ratio values.
cnvlib.descriptives.on_array(default=None)[source]

Ensure a is a numpy array with no missing/NaN values.

cnvlib.descriptives.on_weighted_array(default=None)[source]

Ensure a and w are equal-length numpy arrays with no NaN values.

For weighted descriptives – a is the array of values, w is weights.

  1. Drop any cells in a that are NaN from both a and w
  2. Replace any remaining NaN cells in w with 0.
cnvlib.descriptives.q_n(a, **kwargs)[source]

Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.

Qn := Cn first quartile of (|x_i - x_j|: i < j)

where Cn is a constant depending on n.

Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.

n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
cnvlib.descriptives.weighted_mad(a, w, **kwargs)[source]

Median absolute deviation (MAD) with weights.

cnvlib.descriptives.weighted_median(a, w, **kwargs)[source]

Weighted median of a 1-D numeric array.

cnvlib.descriptives.weighted_std(a, w, **kwargs)[source]

Standard deviation with weights.

parallel

Utilities for multi-core parallel processing.

class cnvlib.parallel.SerialFuture(result)[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Future interface.

result()[source]
class cnvlib.parallel.SerialPool[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Executor interface, but run in serial.

map(func, iterable)[source]

Just apply the function to iterable.

shutdown(wait=True)[source]

Do nothing.

submit(func, *args)[source]

Just call the function on the arguments.

cnvlib.parallel.pick_pool(*args, **kwds)[source]
cnvlib.parallel.rm(path)[source]

Safely remove a file.

cnvlib.parallel.to_chunks(bed_fname, chunk_size=5000)[source]

Split a BED file into chunk_size-line parts for parallelization.

params

Defines several constants used in the pipeline.

Hard-coded parameters for CNVkit. These should not change between runs.

plots

Plotting utilities.

cnvlib.plots.chromosome_sizes(probes, to_mb=False)[source]

Create an ordered mapping of chromosome names to sizes.

cnvlib.plots.cvg2rgb(cvg, desaturate)[source]

Choose a shade of red or blue representing log2-coverage value.

cnvlib.plots.gene_coords_by_name(probes, names)[source]

Find the chromosomal position of each named gene in probes.

Returns:Of: {chromosome: [(start, end, gene name), …]}
Return type:dict
cnvlib.plots.gene_coords_by_range(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]

Find the chromosomal position of all genes in a range.

Returns:Of: {chromosome: [(start, end, gene), …]}
Return type:dict
cnvlib.plots.get_repeat_slices(values)[source]

Find the location and size of each repeat in values.

cnvlib.plots.plot_x_dividers(axis, chrom_sizes, pad=None)[source]

Plot vertical dividers and x-axis labels given the chromosome sizes.

Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.

Returns:A table of the x-position offsets of each chromosome.
Return type:OrderedDict
cnvlib.plots.translate_region_to_bins(region, bins)[source]

Map genomic coordinates to bin indices.

Return a tuple of (chrom, start, end), just like unpack_range.

cnvlib.plots.translate_segments_to_bins(segments, bins)[source]
cnvlib.plots.update_binwise_positions(cnarr, segments=None, variants=None)[source]

Convert start/end positions from genomic to bin-wise coordinates.

Instead of chromosomal basepairs, the positions indicate enumerated bins.

Revise the start and end values for all GenomicArray instances at once, where the cnarr bins are mapped to corresponding segments, and variants are grouped into cnarr bins as well – if multiple variants rows fall within a single bin, equally-spaced fractional positions are used.

Returns copies of the 3 input objects with revised start and end arrays.

cnvlib.plots.update_binwise_positions_simple(cnarr)[source]

rna

RNA expression levels quantified per gene.

Process per-gene expression levels, or the equivalent, by cohort.

cnvlib.rna.align_gene_info_to_samples(gene_info, sample_counts, tx_lengths, normal_ids)[source]

Align columns and sort.

Also calculate weights and add to gene_info as ‘weight’, along with transcript lengths as ‘tx_length’.

cnvlib.rna.attach_gene_info_to_cnr(sample_counts, sample_data_log2, gene_info, read_len=100)[source]

Join gene info to each sample’s log2 expression ratios.

Add the Ensembl gene info to the aggregated gene expected read counts, dropping genes that are not in the Ensembl table I.e., filter probes down to those genes that have names/IDs in the gene resource table.

Split out samples to individual .cnr files, keeping (most) gene info.

cnvlib.rna.before(char)[source]
cnvlib.rna.correct_cnr(cnr, do_gc, do_txlen, max_log2)[source]

Apply bias corrections & smoothing.

  • Biases: ‘gc’, ‘length’
  • Smoothing: rolling triangle window using weights.
cnvlib.rna.dedupe_ens_hugo(dframe)[source]

Emit the “best” index from a group of the same Ensembl ID.

The RSEM gene rows are the data of interest, and they’re associated with Ensembl IDs to indicate the transcribed gene being measured in each row. The BioMart table of gene info can have duplicate rows for Ensembl ID, which would cause duplicate rows in RSEM import if joined naively. So, we need to select a single row for each group of “gene info” rows with the same Ensembl ID (column ‘gene_id’).

The keys we can use for this are:

  • Entrez ID (‘entrez_id’)
  • Ensembl gene name (‘gene’)
  • Entrez gene name (‘hugo_gene’)

Entrez vs. Ensembl IDs and gene names are potentially many-to-many, e.g. CALM1/2/3. However, if we also require that the Ensembl and HUGO gene names match within a group, that (usually? always?) results in a unique row remaining.

(Example: CALM1/2/3 IDs are many-to-many, but of the 3 Entrez IDs associated with Ensembl’s CALM1, only 1 is called CALM1 in the Entrez/corr. table.)

Failing that (no matches or multiple matches), prefer a lower Entrez ID, because well-characterized, protein-coding genes tend to have been discovered and accessioned first.

cnvlib.rna.dedupe_ens_no_hugo(dframe)[source]

Deduplicate Ensembl ID using Entrez ID but not HUGO gene name.

cnvlib.rna.dedupe_tx(dframe)[source]

Deduplicate table rows to select one transcript length per gene.

Choose the lowest-number Entrez ID and the transcript with the greatest support (primarily) and length (secondarily).

This is done at the end of Ensembl ID deduplication, after filtering on gene names and for single-row tables.

Returns an integer row index corresponding to the original table.

cnvlib.rna.filter_probes(sample_counts)[source]

Filter probes to only include high-quality, transcribed genes.

The human genome has ~25,000 protein coding genes, yet the RSEM output includes ~58,000 rows. Some of these rows correspond to read counts over control probes (e.g. spike-in sequences). Some rows correspond to poorly mapped genes in contigs that have not been linked to the 24 chromosomes (e.g. HLA region). Others correspond to pseudo-genes and non-coding genes. For the purposes of copy number inference, these rows are best removed.

cnvlib.rna.load_cnv_expression_corr(fname)[source]
cnvlib.rna.load_gene_info(gene_resource, corr_fname, default_r=0.1)[source]

Read gene info from BioMart, and optionally TCGA, into a dataframe.

RSEM only outputs the Ensembl ID. We have used the BioMART tool in Ensembl to export a list of all Ensembl genes with a RefSeq mRNA (meaning it is high quality, curated, and bona fide gene) and resides on chromosomes 1-22, X, or Y. The tool also outputs the GC content of the gene, chromosomal coordinates of the gene, and HUGO gene symbol.

The gene resource input can be obtained from a resource bundle we provide (for reference genome hg19) or generated from BioMart.

cnvlib.rna.locate_entrez_dupes(dframe)[source]

In case the same Entrez ID was assigned to multiple Ensembl IDs.

Use HUGO vs. HGNC name again, similar to dedupe_hugo, but instead of emiting the indices of the rows to keep, emit the indices of the extra rows – their correlation values will then be filled in with a default value (np.nan or 0.1).

It will then be as if those genes hadn’t appeared in the TCGA tables at all, i.e. CNV-expression correlation is unknown, but all entries are still retained in the BioMart table (gene_info).

cnvlib.rna.normalize_read_depths(sample_depths, normal_ids)[source]

Normalize read depths within each sample.

Some samples have higher sequencing depth and therefore read depths need to be normalized within each sample. TCGA recommends an upper quartile normalization.

After normalizing read depths within each sample, normalize (median-center) within each gene, across samples.

Finally, convert to log2 ratios.

cnvlib.rna.safe_log2(values, min_log2)[source]

Transform values to log2 scale, safely handling zeros.

Parameters:
  • values (np.array) – Absolute-scale values to transform. Should be non-negative.
  • min_log2 (float) – Assign input zeros this log2-scaled value instead of -inf. Rather than hard-clipping, input values near 0 (especially below 2^min_log2) will be squeezed a bit above min_log2 in the log2-scale output.
cnvlib.rna.tsl2int(tsl)[source]

Convert an Ensembl Transcript Support Level (TSL) code to an integer.

The code has the format “tsl([1-5]|NA)”.

See: https://www.ensembl.org/Help/Glossary?id=492

samutil

BAM utilities.

cnvlib.samutil.bam_total_reads(bam_fname)[source]

Count the total number of mapped reads in a BAM file.

Uses the BAM index to do this quickly.

cnvlib.samutil.ensure_bam_index(bam_fname)[source]

Ensure a BAM file is indexed, to enable fast traversal & lookup.

For MySample.bam, samtools will look for an index in these files, in order:

  • MySample.bam.bai
  • MySample.bai
cnvlib.samutil.ensure_bam_sorted(bam_fname, by_name=False, span=50)[source]

Test if the reads in a BAM file are sorted as expected.

by_name=True: reads are expected to be sorted by query name. Consecutive read IDs are in alphabetical order, and read pairs appear together.

by_name=False: reads are sorted by position. Consecutive reads have increasing position.

cnvlib.samutil.get_read_length(bam, span=1000)[source]

Get (median) read length from first few reads in a BAM file.

Illumina reads all have the same length; other sequencers might not.

Parameters:
  • bam (str or pysam.Samfile) – Filename or pysam-opened BAM file.
  • n (int) – Number of reads used to calculate median read length.
cnvlib.samutil.idxstats(bam_fname, drop_unmapped=False)[source]

Get chromosome names, lengths, and number of mapped/unmapped reads.

Use the BAM index (.bai) to get the number of reads and size of each chromosome. Contigs with no mapped reads are skipped.

cnvlib.samutil.is_newer_than(target_fname, orig_fname)[source]

Compare file modification times.

segfilters

Filter copy number segments.

cnvlib.segfilters.ampdel(segarr)[source]

Merge segments by amplified/deleted/neutral copy number status.

Follow the clinical reporting convention:

  • 5+ copies (2.5-fold gain) is amplification
  • 0 copies is homozygous/deep deletion
  • CNAs of lesser degree are not reported

This is recommended only for selecting segments corresponding to actionable, usually focal, CNAs. Any real and potentially informative but lower-level CNAs will be dropped.

cnvlib.segfilters.bic(segarr)[source]

Merge segments by Bayesian Information Criterion.

See: BIC-seq (Xi 2011), doi:10.1073/pnas.1110574108

cnvlib.segfilters.ci(segarr)[source]

Merge segments by confidence interval (overlapping 0).

Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.

cnvlib.segfilters.cn(segarr)[source]

Merge segments by integer copy number.

cnvlib.segfilters.enumerate_changes(levels)[source]

Assign a unique integer to each run of identical values.

Repeated but non-consecutive values will be assigned different integers.

cnvlib.segfilters.require_column(*colnames)[source]

Wrapper to coordinate the segment-filtering functions.

Verify that the given columns are in the CopyNumArray the wrapped function takes. Also log the number of rows in the array before and after filtration.

cnvlib.segfilters.sem(segarr)[source]

Merge segments by Standard Error of the Mean (SEM).

Use each segment’s SEM value to estimate a 95% confidence interval (via zscore). Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.

cnvlib.segfilters.squash_by_groups(cnarr, levels, by_arm=False)[source]

Reduce CopyNumArray rows to a single row within each given level.

cnvlib.segfilters.squash_region(cnarr)[source]

Reduce a CopyNumArray to 1 row, keeping fields sensible.

Most fields added by the segmetrics command will be dropped.

smoothing

Signal smoothing functions.

cnvlib.smoothing.check_inputs(x, width, as_series=True)[source]

Transform width into a half-window size.

width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.

cnvlib.smoothing.convolve_unweighted(window, signal, wing)[source]

Convolve a weighted window over array signal.

Input array is assumed padded by _pad_array; output has padding removed.

cnvlib.smoothing.convolve_weighted(window, signal, weights)[source]

Convolve a weighted window over a weighted signal array.

Source: https://stackoverflow.com/a/46232913/10049

cnvlib.smoothing.guess_window_size(x, weights=None)[source]

Choose a reasonable window size given the signal.

Inspired by Silverman’s rule: bandwidth is proportional to signal’s standard deviation and the length of the signal ^ 4/5.

cnvlib.smoothing.kaiser(x, width=None, weights=None, do_fit_edges=False)[source]

Smooth the values in x with the Kaiser windowed filter.

See: https://en.wikipedia.org/wiki/Kaiser_window

Parameters:
  • x (array-like) – 1-dimensional numeric data set.
  • width (float) – Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.
cnvlib.smoothing.outlier_iqr(a, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.

cnvlib.smoothing.outlier_mad_median(a)[source]

MAD-Median rule for detecting outliers.

X_i is an outlier if:

 | X_i - M |
_____________  > K ~= 2.24

 MAD / 0.6745

where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.

This is a very robust rule with the highest possible breakdown point of 0.5.

Returns:A boolean array of the same size as a, where outlier indices are True.
Return type:np.array

References

  • Davies & Gather (1993) The Identification of Multiple Outliers.
  • Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
cnvlib.smoothing.rolling_median(x, width)[source]

Rolling median with mirrored edges.

cnvlib.smoothing.rolling_outlier_iqr(x, width, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).

cnvlib.smoothing.rolling_outlier_quantile(x, width, q, m)[source]

Detect outliers by multiples of a quantile in a window.

Outliers are the array elements outside m times the q’th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)

This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.

Returns:A boolean array of the same size as x, where outlier indices are True.
Return type:np.array
cnvlib.smoothing.rolling_outlier_std(x, width, stdevs)[source]

Detect outliers by stdev within a rolling window.

Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.

Returns:A boolean array of the same size as x, where outlier indices are True.
Return type:np.array
cnvlib.smoothing.rolling_quantile(x, width, quantile)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.rolling_std(x, width)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.savgol(x, width=None, weights=None, order=None)[source]

Savitzky-Golay smoothing.