Python API (cnvlib package)

Module cnvlib contents

cnvlib.read(fname)[source]

Parse a file as a copy number or copy ratio table (.cnn, .cnr).

The one function exposed at the top level, read, loads a file in CNVkit’s BED-like tabular format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using just the CopyNumArray and GenomicArray methods available on this returned object (see Core classes).

To load other file formats, see Tabular file I/O. To run functions equivalent to CNVkit commands within Python, see Interface to CNVkit sub-commands.

Core classes

The core objects used throughout CNVkit. The base class is GenomicArray. All of these classes wrap a pandas DataFrame instance accessible through the .data attribute which can be used for any manipulations that aren’t already provided by methods in the wrapper class.

gary

Base class for an array of annotated genomic regions.

class cnvlib.genome.gary.GenomicArray(data_table, meta_dict=None)[source]

Bases: future.types.newobject.newobject

An array of genomic intervals. Base class for genomic data structures.

Can represent most BED-like tabular formats with arbitrary additional columns.

add(other)[source]

Combine this array’s data with another GenomicArray (in-place).

Any optional columns must match between both arrays.

add_columns(**columns)[source]

Add the given columns to a copy of this GenomicArray.

Parameters:**columns (array) – Keyword arguments where the key is the new column’s name and the value is an array of the same length as self which will be the new column’s values.
Returns:A new instance of self with the given columns included in the underlying dataframe.
Return type:GenomicArray or subclass
as_columns(**columns)[source]

Wrap the named columns in this instance’s metadata.

as_dataframe(dframe)[source]

Wrap the given pandas dataframe in this instance’s metadata.

as_rows(rows)[source]

Wrap the given rows in this instance’s metadata.

autosomes(also=())[source]

Select chromosomes w/ integer names, ignoring any ‘chr’ prefixes.

by_chromosome()[source]

Iterate over bins grouped by chromosome name.

by_ranges(other, mode='inner', keep_empty=True)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

For example, this can be used to group SNVs by CNV segments.

Bins in this array that fall outside the other array’s bins are skipped.

Parameters:
  • other (GenomicArray) – Another GA instance.
  • mode (string) –

    Determines what to do with bins that overlap a boundary of the selection. Possible values are:

    • inner: Drop the bins on the selection boundary, don’t emit them.
    • outer: Keep/emit those bins as they are.
    • trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
  • keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields:

tuple – (other bin, GenomicArray of overlapping rows in self)

chromosome
concat(others)[source]

Concatenate several GenomicArrays, keeping this array’s metadata.

This array’s data table is not implicitly included in the result.

coords(also=())[source]

Iterate over plain coordinates of each bin: chromosome, start, end.

Parameters:
  • also (str, or iterable of strings) – Also include these columns from self, in addition to chromosome, start, and end.
  • yielding rows in BED format (Example,) –
  • probes.coords(also=["gene", "strand"]) (>>>) –
copy()[source]

Create an independent copy of this object.

cut(other, combine=None)[source]

Split this array’s regions at the boundaries in other.

drop_extra_columns()[source]

Remove any optional columns from this GenomicArray.

Returns:A new copy with only the minimal set of columns required by the class (e.g. chromosome, start, end for GenomicArray; may be more for subclasses).
Return type:GenomicArray or subclass
end
filter(func=None, **kwargs)[source]

Take a subset of rows where the given condition is true.

Parameters:
  • func (callable) – A boolean function which will be applied to each row to keep rows where the result is True.
  • **kwargs (string) – Keyword arguments like chromosome="chr7" or gene="Background", which will keep rows where the keyed field equals the specified value.
Returns:

Subset of self where the specified condition is True.

Return type:

GenomicArray

flatten(combine=None)[source]

Split this array’s regions where they overlap.

classmethod from_columns(columns, meta_dict=None)[source]

Create a new instance from column arrays, given as a dict.

classmethod from_rows(rows, columns=None, meta_dict=None)[source]

Create a new instance from a list of rows, as tuples or arrays.

in_range(chrom=None, start=None, end=None, mode='inner')[source]

Get the GenomicArray portion within the given genomic range.

Parameters:
  • chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
  • start (int or None) – Start coordinate of range to select, in 0-based coordinates. If None, start from 0.
  • end (int or None) – End coordinate of range to select. If None, select to the end of the chromosome.
  • mode (str) – As in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.
Returns:

The subset of self enclosed by the specified range.

Return type:

GenomicArray

in_ranges(chrom=None, starts=None, ends=None, mode='inner')[source]

Get the GenomicArray portion within the specified ranges.

Similar to in_ranges, but concatenating the selections of all the regions specified by the starts and ends arrays.

Parameters:
  • chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
  • starts (int array, or None) – Start coordinates of ranges to select, in 0-based coordinates. If None, start from 0.
  • ends (int array, or None) – End coordinates of ranges to select. If None, select to the end of the chromosome. If starts and ends are both specified, they must be arrays of equal length.
  • mode (str) – As in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.
Returns:

Concatenation of all the subsets of self enclosed by the specified ranges.

Return type:

GenomicArray

into_ranges(other, column, default, summary_func=None)[source]

Re-bin values from column into the corresponding ranges in other.

Match overlapping/intersecting rows from other to each row in self. Then, within each range in other, extract the value(s) from column in self, using the function summary_func to produce a single value if multiple bins in self map to a single range in other.

For example, group SNVs (self) by CNV segments (other) and calculate the median (summary_func) of each SNV group’s allele frequencies.

Parameters:
  • other (GenomicArray) – Bins to
  • column (string) – Column name in self to extract values from.
  • default – Value to assign to indices in other that do not overlap any bins in self. Type should be the same as or compatible with the output field specified by column, or the output of summary_func.
  • summary_func (callable, dict of string-to-callable, or None) –

    Specify how to reduce 1 or more other rows into a single value for the corresponding row in self.

    • If callable, apply to the column field each group of rows in other column.
    • If a single-element dict of column name to callable, apply to that field in other instead of column.
    • If None, use an appropriate summarizing function for the datatype of the column column in other (e.g. median of numbers, concatenation of strings).
    • If some other value, assign that value to self wherever there is an overlap.
Returns:

The extracted and summarized values from self corresponding to other’s genomic ranges, the same length as other.

Return type:

pd.Series

keep_columns(columns)[source]

Extract a subset of columns, reusing this instance’s metadata.

labels()[source]
merge(bp=0, stranded=False, combine=None)[source]

Merge adjacent or overlapping regions into single rows.

Similar to ‘bedtools merge’.

resize_ranges(bp, chrom_sizes=None)[source]

Resize each genomic bin by a fixed number of bases at each end.

Bin ‘start’ values have a minimum of 0, and chrom_sizes can specify each chromosome’s maximum ‘end’ value.

Similar to ‘bedtools slop’.

Parameters:
  • bp (int) – Number of bases in each direction to expand or shrink each bin. Applies to ‘start’ and ‘end’ values symmetrically, and may be positive (expand) or negative (shrink).
  • chrom_sizes (dict of string-to-int) – Chromosome name to length in base pairs. If given, all chromosomes in self must be included.
static row2label(row)[source]
sample_id
shuffle()[source]

Randomize the order of bins in this array (in-place).

sort()[source]

Sort this array’s bins in-place, with smart chromosome ordering.

sort_columns()[source]

Sort this array’s columns in-place, per class definition.

squash(combine=None)[source]

Combine some groups of rows, by some criteria, into single rows.

start
subdivide(avg_size, min_size=0, verbose=False)[source]

Split this array’s regions into roughly equal-sized sub-regions.

subtract(other)[source]

Remove the overlapping regions in other from this array.

total_range_size()[source]

Total number of bases covered by all (merged) regions.

cnary

CNVkit’s core data structure, a copy number array.

class cnvlib.cnary.CopyNumArray(data_table, meta_dict=None)[source]

Bases: cnvlib.genome.gary.GenomicArray

An array of genomic intervals, treated like aCGH probes.

Required columns: chromosome, start, end, gene, log2

Optional columns: gc, rmask, spread, weight, probes

by_gene(ignore=('-', '.', 'CGH'))[source]

Iterate over probes grouped by gene name.

Group each series of intergenic bins as a “Background” gene; any “Background” bins within a gene are grouped with that gene.

Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.

Parameters:ignore (list or tuple of str) – Gene names to treat as “Background” bins instead of real genes, grouping these bins with the surrounding gene or background region. These bins will still retain their name in the output.
Yields:tuple – Pairs of: (gene name, CNA of rows with same name)
center_all(estimator=<unbound method Series.median>, skip_low=False, by_chrom=True)[source]

Re-center log2 values to the autosomes’ average (in-place).

Parameters:
  • estimator (str or callable) – Function to estimate central tendency. If a string, must be one of ‘mean’, ‘median’, ‘mode’, ‘biweight’ (for biweight location). Median by default.
  • skip_low (bool) – Whether to drop very-low-coverage bins (via drop_low_coverage) before estimating the center value.
  • by_chrom (bool) – If True, first apply estimator to each chromosome separately, then apply estimator to the per-chromosome values, to reduce the impact of uneven targeting or extreme aneuploidy. Otherwise, apply estimator to all log2 values directly.
compare_sex_chromosomes(male_reference=False, skip_low=False)[source]

Compare coverage ratios of sex chromosomes versus autosomes.

Perform 4 Mood’s median tests of the log2 coverages on chromosomes X and Y, separately shifting for assumed male and female chromosomal sex. Compare the chi-squared values obtained to infer whether the male or female assumption fits the data better.

Parameters:
  • male_reference (bool) – Whether a male reference copy number profile was used to normalize the data. If so, a male sample should have log2 values of 0 on X and Y, and female +1 on X, deep negative (below -3) on Y. Otherwise, a male sample should have log2 values of -1 on X and 0 on Y, and female 0 on X, deep negative (below -3) on Y.
  • skip_low (bool) – If True, drop very-low-coverage bins (via drop_low_coverage) before comparing log2 coverage ratios. Included for completeness, but shouldn’t affect the result much since the M-W test is nonparametric and p-values are not used here.
Returns:

  • bool – True if the sample appears male.
  • dict – Calculated values used for the inference: relative log2 ratios of chromosomes X and Y versus the autosomes; the Mann-Whitney U values from each test; and ratios of U values for male vs. female assumption on chromosomes X and Y.

drop_low_coverage(verbose=False)[source]

Drop bins with extremely low log2 coverage or copy ratio values.

These are generally bins that had no reads mapped due to sample-specific issues. A very small log2 ratio or coverage value may have been substituted to avoid domain or divide-by-zero errors.

expect_flat_log2(is_male_reference=None)[source]

Get the uninformed expected copy ratios of each bin.

Create an array of log2 coverages like a “flat” reference.

This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).

guess_xx(male_reference=False, verbose=True)[source]

Detect chromosomal sex; return True if a sample is probably female.

Uses compare_sex_chromosomes to calculate coverage ratios of the X and Y chromosomes versus autosomes.

Parameters:
  • male_reference (bool) – Was this sample normalized to a male reference copy number profile?
  • verbose (bool) – If True, print (i.e. log to console) the ratios of the log2 coverages of the X and Y chromosomes versus autosomes, the “maleness” ratio of male vs. female expectations for each sex chromosome, and the inferred chromosomal sex.
Returns:

True if the coverage ratios indicate the sample is female.

Return type:

bool

log2
residuals(segments=None)[source]

Difference in log2 value of each bin from its segment mean.

Parameters:segments (GenomicArray, CopyNumArray, or None) –

Determines the “mean” value to which self log2 values are relative:

  • If CopyNumArray, use the log2 values as the segment means to subtract.
  • If GenomicArray with no log2 values, group self by these ranges and subtract each group’s median log2 value.
  • If None, subtract each chromosome’s median.
Returns:Residual log2 values from self relative to segments; same length as self.
Return type:array
shift_xx(male_reference=False, is_xx=None)[source]

Adjust chrX log2 ratios (subtract 1) for apparent female samples.

squash_genes(summary_func=<function biweight_location>, squash_background=False, ignore=('-', '.', 'CGH'))[source]

Combine consecutive bins with the same targeted gene name.

Parameters:
  • summary_func (callable) – Function to summarize an array of log2 values to produce a new log2 value for a “squashed” (i.e. reduced) region. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.
  • squash_background (bool) – If True, also reduce consecutive “Background” bins into a single bin. Otherwise, keep “Background” and ignored bins as they are in the output.
  • ignore (list or tuple of str) – Bin names to be treated as “Background” instead of as unique genes.
Returns:

Another, usually smaller, copy of self with each gene’s bins reduced to a single bin with appropriate values.

Return type:

CopyNumArray

vary

An array of genomic intervals, treated as variant loci.

class cnvlib.vary.VariantArray(data_table, meta_dict=None)[source]

Bases: cnvlib.genome.gary.GenomicArray

An array of genomic intervals, treated as variant loci.

Required columns: chromosome, start, end, ref, alt

baf_by_ranges(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=True)[source]

Aggregate variant (b-allele) frequencies in each given bin.

Get the average BAF in each of the bins of another genomic array: BAFs are mirrored above/below 0.5 (per above_half), grouped in each bin of ranges, and summarized into one value per bin with summary_func (default median).

Parameters:
  • ranges (GenomicArray or subclass) – Bins for grouping the variants in self.
  • above_half (bool) – The same as in mirrored_baf.
  • tumor_boost (bool) – The same as in mirrored_baf.
Returns:

Average b-allele frequency in each range; same length as ranges. May contain NaN values where no variants overlap a range.

Return type:

float array

heterozygous()[source]

Subset to only heterozygous variants.

Use ‘zygosity’ or ‘n_zygosity’ genotype values (if present) to exclude variants with value 0.0 or 1.0. If these columns are missing, or there are no heterozygous variants, then return the full (input) set of variants.

Returns:The subset of self with heterozygous genotype, or allele frequency between the specified thresholds.
Return type:VariantArray
mirrored_baf(above_half=None, tumor_boost=False)[source]

Mirrored B-allele frequencies (BAFs).

Parameters:
  • above_half (bool or None) – If specified, flip BAFs to be all above 0.5 (True) or below 0.5 (False), respectively, for consistency. Otherwise, if None, mirror in the direction of the majority of BAFs.
  • tumor_boost (bool) – Normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.
Returns:

Mirrored b-allele frequencies, the same length as self. May contain NaN values.

Return type:

float array

tumor_boost()[source]

TumorBoost normalization of tumor-sample allele frequencies.

De-noises the signal for detecting LOH.

See: TumorBoost, Bengtsson et al. 2010

zygosity_from_freq(het_freq=0.0, hom_freq=1.0)[source]

Set zygosity (genotype) according to allele frequencies.

Creates or replaces ‘zygosity’ column if ‘alt_freq’ column is present, and ‘n_zygosity’ if ‘n_alt_freq’ is present.

Parameters:
  • het_freq (float) – Assign zygosity 0.5 (heterozygous), otherwise 0.0 (i.e. reference genotype), to variants with alt allele frequency of at least this value.
  • hom_freq (float) – Assign zygosity 1.0 (homozygous) to variants with alt allele frequency of at least this value.

Tabular file I/O

tabio

I/O for tabular formats of genomic data (regions or features).

cnvlib.tabio.read(infile, fmt='tab', into=None, sample_id=None, meta=None, **kwargs)[source]

Read tabular data from a file or stream into a genome object.

Supported formats: see READERS

If a format supports multiple samples, return the sample specified by sample_id, or if unspecified, return the first sample and warn if there were other samples present in the file.

Parameters:
  • infile (handle or string) – Filename or opened file-like object to read.
  • fmt (string) – File format.
  • into (class) – GenomicArray class or subclass to instantiate, overriding the default for the target file format.
  • sample_id (string) – Sample identifier.
  • meta (dict) – Metadata, as arbitrary key-value pairs.
  • **kwargs – Additional keyword arguments to the format-specific reader function.
Returns:

The data from the given file instantiated as into, if specified, or the default base class for the given file format (usually GenomicArray).

Return type:

GenomicArray or subclass

cnvlib.tabio.read_auto(infile)[source]

Auto-detect a file’s format and use an appropriate parser to read it.

cnvlib.tabio.read_cna(infile, sample_id=None, meta=None)[source]

Read a tabular file to create a CopyNumArray object.

cnvlib.tabio.sniff_region_format(fname)[source]

Guess the format of the given file by reading the first line.

Returns:The detected format name, or None if the file is empty.
Return type:str or None
cnvlib.tabio.write(garr, outfile=None, fmt='tab', verbose=True, **kwargs)[source]

Write a genome object to a file or stream.

Interface to CNVkit sub-commands

commands

The public API for each of the commands defined in the CNVkit workflow.

Command-line interface and corresponding API for CNVkit.

cnvlib.commands.do_target(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.commands.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.commands.do_antitarget(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]

Derive a background/antitarget BED file from a target BED file.

cnvlib.commands.do_autobin(bam_fname, method, targets=None, access=None, bp_per_bin=100000.0, target_min_size=20, target_max_size=20000, antitarget_min_size=500, antitarget_max_size=500000)[source]

Quickly calculate reasonable bin sizes from BAM read counts.

Parameters:
  • bam_fname (string) – BAM filename.
  • method (string) – One of: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), ‘hybrid’ (hybridization capture).
  • targets (GenomicArray) – Targeted genomic regions (for ‘hybrid’ and ‘amplicon’).
  • access (GenomicArray) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
  • bp_per_bin (int) – Desired number of sequencing read nucleotide bases mapped to each bin.
Returns:

((target depth, target avg. bin size),

(antitarget depth, antitarget avg. bin size))

Return type:

2-tuple of 2-tuples

cnvlib.commands.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.commands.do_reference(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.commands.do_reference_flat(targets, antitargets=None, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.commands.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.commands.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]

Infer copy number segments from the given coverage table.

cnvlib.commands.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.commands.do_scatter(cnarr, segments=None, variants=None, show_range=None, show_gene=None, background_marker=None, do_trend=False, window_width=1000000.0, y_min=None, y_max=None, title=None, segment_color='darkorange')[source]

Plot probe log2 coverages and segmentation calls together.

cnvlib.commands.do_heatmap(cnarrs, show_range=None, do_desaturate=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.commands.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.commands.do_gainloss(cnarr, segments=None, threshold=0.2, min_probes=3, skip_low=False, male_reference=False, is_sample_female=None)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_sex(cnarrs, is_male_reference)[source]

Guess samples’ sex from the relative coverage of chromosomes X and Y.

cnvlib.commands.do_sex(cnarrs, is_male_reference)[source]

Guess samples’ sex from the relative coverage of chromosomes X and Y.

cnvlib.commands.do_metrics(cnarrs, segments=None, skip_low=False)[source]

Compute coverage deviations and other metrics for self-evaluation.

cnvlib.commands.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]

The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.

access

List the locations of accessible sequence regions in a FASTA file.

Inaccessible regions, e.g. telomeres and centromeres, are masked out with N in the reference genome sequence; this script scans those to identify the coordinates of the accessible regions (those between the long spans of N’s).

cnvlib.access.do_access(fa_fname, exclude_fnames=(), min_gap_size=5000)[source]

List the locations of accessible sequence regions in a FASTA file.

cnvlib.access.get_regions(fasta_fname)[source]

Find accessible sequence regions (those not masked out with ‘N’).

cnvlib.access.join_regions(regions, min_gap_size)[source]

Filter regions, joining those separated by small gaps.

cnvlib.access.log_this(chrom, run_start, run_end)[source]

Log a coordinate range, then return it as a tuple.

antitarget

Supporting functions for the ‘antitarget’ command.

cnvlib.antitarget.compare_chrom_names(a_regions, b_regions)[source]
cnvlib.antitarget.do_antitarget(targets, access=None, avg_bin_size=150000, min_bin_size=None)[source]

Derive a background/antitarget BED file from a target BED file.

cnvlib.antitarget.get_background(targets, accessible, avg_bin_size, min_bin_size)[source]

Generate background intervals from target intervals.

Procedure:

  • Invert target intervals

  • Subtract the inverted targets from accessible regions

  • For each of the resulting regions:

    • Shrink by a fixed margin on each end
    • If it’s smaller than min_bin_size, skip
    • Divide into equal-size (region_size/avg_bin_size) portions
    • Emit the (chrom, start, end) coords of each portion
cnvlib.antitarget.guess_chromosome_regions(targets, telomere_size)[source]

Determine (minimum) chromosome lengths from target coordinates.

call

Call copy number variants from segmented log2 ratios.

cnvlib.call.absolute_clonal(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_dataframe(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Absolute, expected and reference copy number in a DataFrame.

cnvlib.call.absolute_expect(cnarr, ploidy, is_sample_female)[source]

Absolute integer number of expected copies in each bin.

I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified chromosomal sex.

cnvlib.call.absolute_pure(cnarr, ploidy, is_reference_male)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.call.absolute_reference(cnarr, ploidy, is_reference_male)[source]

Absolute integer number of reference copies in each bin.

I.e. the given ploidy for autosomes, 1 or 2 X according to the reference sex, and always 1 copy of Y.

cnvlib.call.absolute_threshold(cnarr, ploidy, thresholds, is_reference_male)[source]

Call integer copy number using hard thresholds for each level.

Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.

Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:

R> log2(2:6 / 4)
-1.0  -0.4150375  0.0  0.3219281  0.5849625

Allowing for random noise of +/- 0.1, the cutoffs are:

DEL(0)  <  -1.1
LOSS(1) <  -0.25
GAIN(3) >=  +0.2
AMP(4)  >=  +0.7

For germline samples, better precision could be achieved with:

LOSS(1) <  -0.4
GAIN(3) >=  +0.3
cnvlib.call.do_call(cnarr, variants=None, method='threshold', ploidy=2, purity=None, is_reference_male=False, is_sample_female=False, filters=None, thresholds=(-1.1, -0.25, 0.2, 0.7))[source]
cnvlib.call.log2_ratios(cnarr, absolutes, ploidy, is_reference_male, min_abs_val=0.001, round_to_int=False)[source]

Convert absolute copy numbers to log2 ratios.

Optionally round copy numbers to integers.

Account for reference sex & ploidy of sex chromosomes.

cnvlib.call.rescale_baf(purity, observed_baf, normal_baf=0.5)[source]

Adjust B-allele frequencies for sample purity.

Math:

t_baf*purity + n_baf*(1-purity) = obs_baf
obs_baf - n_baf * (1-purity) = t_baf * purity
t_baf = (obs_baf - n_baf * (1-purity))/purity

coverage

Supporting functions for the ‘antitarget’ command.

cnvlib.coverage.bedcov(bed_fname, bam_fname, min_mapq)[source]

Calculate depth of all regions in a BED file via samtools (pysam) bedcov.

i.e. mean pileup depth across each region.

cnvlib.coverage.detect_bedcov_columns(text)[source]
cnvlib.coverage.do_coverage(bed_fname, bam_fname, by_count=False, min_mapq=0, processes=1)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.coverage.interval_coverages(bed_fname, bam_fname, by_count, min_mapq, processes)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_count(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_pileup(bed_fname, bam_fname, min_mapq, procs=1)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.region_depth_count(bamfile, chrom, start, end, gene, min_mapq)[source]

Calculate depth of a region via pysam count.

i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.

Coordinates are 0-based, per pysam.

diagram

Chromosome diagram drawing functions.

This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)

cnvlib.diagram.bc_chromosome_draw_label(self, cur_drawing, label_name)[source]

Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.

Draw a label for the chromosome. Mod: above the chromosome, not below.

cnvlib.diagram.bc_organism_draw(org, title, wrap=12)[source]

Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.

Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.

Parameters:
  • org – The chromosome diagram object being modified.
  • title (str) – The output title of the produced document.
  • wrap (int) – Maximum number of chromosomes per row; the remainder will be wrapped to the next row(s).
cnvlib.diagram.build_chrom_diagram(features, chr_sizes, sample_id)[source]

Create a PDF of color-coded features on chromosomes.

cnvlib.diagram.create_diagram(cnarr, segarr, threshold, min_probes, outfname)[source]

Create the diagram.

export

Export CNVkit objects and files to other formats.

cnvlib.export.create_chrom_ids(segments)[source]

Map chromosome names to integers in the order encountered.

cnvlib.export.export_bed(segments, ploidy, is_reference_male, is_sample_female, label, show)[source]

Convert a copy number array to a BED-like DataFrame.

For each region in each sample (possibly filtered according to show), the columns are:

  • reference sequence name
  • start (0-indexed)
  • end
  • sample name or given label
  • integer copy number

By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.

cnvlib.export.export_nexus_basic(cnarr)[source]

Biodiscovery Nexus Copy Number “basic” format.

Only represents one sample per file.

cnvlib.export.export_nexus_ogt(cnarr, varr, min_weight=0.0)[source]

Biodiscovery Nexus Copy Number “Custom-OGT” format.

To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.

cnvlib.export.export_seg(sample_fnames)[source]

SEG format for copy number segments.

Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.

cnvlib.export.export_theta(tumor_segs, normal_cn)[source]

Convert tumor segments and normal .cnr or reference .cnn to THetA input.

Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.

The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.

THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount

where chromosome IDs (“chrm”) are integers 1 through 24.

cnvlib.export.export_theta_snps(varr)[source]

Generate THetA’s SNP per-allele read count “formatted.txt” files.

cnvlib.export.export_vcf(segments, ploidy, is_reference_male, is_sample_female, sample_id=None)[source]

Convert segments to Variant Call Format.

For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)

Spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf

cnvlib.export.fmt_cdt(sample_ids, table)[source]

Format as CDT.

cnvlib.export.fmt_gct(sample_ids, table)[source]
cnvlib.export.fmt_jtv(sample_ids, table)[source]

Format for Java TreeView.

cnvlib.export.merge_samples(filenames)[source]

Merge probe values from multiple samples into a 2D table (of sorts).

Input:
dict of {sample ID: (probes, values)}
Output:
list-of-tuples: (probe, log2 coverages...)
cnvlib.export.ref_means_nbins(tumor_segs, normal_cn)[source]

Extract segments’ reference mean log2 values and probe counts.

Code paths:

wt_mdn  wt_old  probes  norm -> norm, nbins
+       *       *       -       0,  wt_mdn
-       +       +       -       0,  wt_old * probes
-       +       -       -       0,  wt_old * size?
-       -       +       -       0,  probes
-       -       -       -       0,  size?

+       -       +       +       norm, probes
+       -       -       +       norm, bin counts
-       +       +       +       norm, probes
-       +       -       +       norm, bin counts
-       -       +       +       norm, probes
-       -       -       +       norm, bin counts
cnvlib.export.segments2vcf(segments, ploidy, is_reference_male, is_sample_female)[source]

Convert copy number segments to VCF records.

cnvlib.export.theta_read_counts(log2_ratio, nbins, avg_depth=500, avg_bin_width=200, read_len=100)[source]

Calculate segments’ read counts from log2-ratios.

Math:
nbases = read_length * read_count
and
nbases = bin_width * read_depth
where
read_depth = read_depth_ratio * avg_depth
So:
read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length

fix

Supporting functions for the ‘fix’ command.

cnvlib.fix.apply_weights(cnarr, ref_matched, epsilon=0.0001)[source]

Calculate weights for each bin.

Weights are derived from:

  • bin sizes
  • average bin coverage depths in the reference
  • the “spread” column of the reference.
cnvlib.fix.center_by_window(cnarr, fraction, sort_key)[source]

Smooth out biases according to the trait specified by sort_key.

E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.

cnvlib.fix.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.fix.edge_gains(target_sizes, gap_sizes, insert_size)[source]

Calculate coverage gain from neighboring baits’ flanking reads.

Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:

.. math :: (i-g)^2 / 4it

If the neighbor flank extends beyond the target (t+g < i), reduce by:

.. math :: (i-t-g)^2 / 4it

If a neighbor overlaps the target, treat it as adjacent (gap size 0).

cnvlib.fix.edge_losses(target_sizes, insert_size)[source]

Calculate coverage losses at the edges of baited regions.

Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:

\[i/2t\]

If the “shoulders” extend outside the bait $(t < i), reduce by:

\[(i-t)^2 / 4it\]

on each side, or (i-t)^2 / 2it total.

cnvlib.fix.get_edge_bias(cnarr, margin)[source]

Quantify the “edge effect” of the target tile and its neighbors.

The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.

(This is not the actual change in coverage. This is just a tribute.)

cnvlib.fix.load_adjust_coverages(cnarr, ref_cnarr, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Load and filter probe coverages; correct using reference and GC.

cnvlib.fix.mask_bad_bins(cnarr)[source]

Flag the bins with excessively low or inconsistent coverage.

Returns:A boolean array where True indicates bins that failed the checks.
Return type:np.array
cnvlib.fix.match_ref_to_sample(ref_cnarr, samp_cnarr)[source]

Filter the reference bins to match the sample (target or antitarget).

importers

Import from other formats to the CNVkit format.

cnvlib.importers.do_import_theta(segarr, theta_results_fname, ploidy=2)[source]
cnvlib.importers.find_picard_files(file_and_dir_names)[source]

Search the given paths for ‘targetcoverage’ CSV files.

Per the convention we use in our Picard applets, the target coverage file names end with ‘.targetcoverage.csv’; anti-target coverages end with ‘.antitargetcoverage.csv’.

cnvlib.importers.parse_theta_results(fname)[source]

Parse THetA results into a data structure.

Columns: NLL, mu, C, p*

metrics

Robust metrics to evaluate performance of copy number estimates.

cnvlib.metrics.confidence_interval_bootstrap(bins, alpha, bootstraps=100, smoothed=True)[source]

Confidence interval for segment mean log2 value, estimated by bootstrap.

cnvlib.metrics.do_metrics(cnarrs, segments=None, skip_low=False)[source]

Compute coverage deviations and other metrics for self-evaluation.

cnvlib.metrics.ests_of_scale(deviations)[source]

Estimators of scale: standard deviation, MAD, biweight midvariance.

Calculates all of these values for an array of deviations and returns them as a tuple.

cnvlib.metrics.prediction_interval(bins, alpha)[source]

Prediction interval, estimated by percentiles.

cnvlib.metrics.segment_mean(cnarr, skip_low=False)[source]

Weighted average of bin log2 values.

cnvlib.metrics.zip_repeater(iterable, repeatable)[source]

Repeat a single segmentation to match the number of copy ratio inputs

reference

Supporting functions for the ‘reference’ command.

cnvlib.reference.bed2probes(bed_fname)[source]

Create neutral-coverage probes from intervals.

cnvlib.reference.calculate_gc_lo(subseq)[source]

Calculate the GC and lowercase (RepeatMasked) content of a string.

cnvlib.reference.combine_probes(filenames, fa_fname, is_male_reference, is_female_sample, skip_low, fix_gc, fix_edge, fix_rmask)[source]

Calculate the median coverage of each bin across multiple samples.

Parameters:
  • filenames (list) – List of string filenames, corresponding to targetcoverage.cnn and antitargetcoverage.cnn files, as generated by ‘coverage’ or ‘import-picard’.
  • fa_fname (str) – Reference genome sequence in FASTA format, used to extract GC and RepeatMasker content of each genomic bin.
  • is_male_reference (bool) –
  • skip_low (bool) –
  • fix_gc (bool) –
  • fix_edge (bool) –
  • fix_rmask (bool) –
Returns:

One object summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and GC content.

Return type:

CopyNumArray

cnvlib.reference.do_reference(target_fnames, antitarget_fnames=None, fa_fname=None, male_reference=False, female_samples=None, do_gc=True, do_edge=True, do_rmask=True)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.reference.do_reference_flat(targets, antitargets=None, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.reference.fasta_extract_regions(fa_fname, intervals)[source]

Extract an iterable of regions from an indexed FASTA file.

Input: FASTA file name; iterable of (seq_id, start, end) (1-based) Output: iterable of string sequences.

cnvlib.reference.get_fasta_stats(cnarr, fa_fname)[source]

Calculate GC and RepeatMasker content of each bin in the FASTA genome.

cnvlib.reference.reference2regions(refarr)[source]

Split reference into target and antitarget regions.

cnvlib.reference.warn_bad_probes(probes, max_name_width=50)[source]

Warn about target probes where coverage is poor.

Prints a formatted table to stderr.

reports

Supports the sub-commands breaks and gainloss.

Supporting functions for the text/tabular-reporting commands.

Namely: breaks, gainloss.

cnvlib.reports.gainloss_by_gene(cnarr, threshold, skip_low=False)[source]

Identify genes where average bin copy ratio value exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.gainloss_by_segment(cnarr, segments, threshold, skip_low=False)[source]

Identify genes where segmented copy ratio exceeds threshold.

NB: Must shift sex-chromosome values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.get_breakpoints(intervals, segments, min_probes)[source]

Identify CBS segment breaks within the targeted intervals.

cnvlib.reports.get_gene_intervals(all_probes, ignore=('-', '.', 'CGH'))[source]

Tally genomic locations of each targeted gene.

Return a dict of chromosomes to a list of tuples: (gene name, start, end).

cnvlib.reports.group_by_genes(cnarr, skip_low)[source]

Group probe and coverage data by gene.

Return an iterable of genes, in chromosomal order, associated with their location and coverages:

[(gene, chrom, start, end, [coverages]), ...]

segmentation

Segmentation of copy number values.

cnvlib.segmentation.do_segmentation(cnarr, method, threshold=None, variants=None, skip_low=False, skip_outliers=10, save_dataframe=False, rlibpath=None, processes=1)[source]

Infer copy number segments from the given coverage table.

cnvlib.segmentation.drop_outliers(cnarr, width, factor)[source]

Drop outlier bins with log2 ratios too far from the trend line.

Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.

cnvlib.segmentation.repair_segments(segments, orig_probes)[source]

Post-process segmentation output.

  1. Ensure every chromosome has at least one segment.
  2. Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
cnvlib.segmentation.squash_segments(segments)[source]

Combine contiguous segments.

cnvlib.segmentation.transfer_fields(segments, cnarr, ignore=('-', '.', 'CGH'))[source]

Map gene names, weights, depths from cnarr bins to segarr segments.

Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.

target

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.target.do_target(bait_arr, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.target.filter_names(names, exclude=('mRNA', ))[source]

Remove less-meaningful accessions from the given set.

cnvlib.target.shorten_labels(gene_labels)[source]

Reduce multi-accession interval labels to the minimum consistent.

So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.

e.g.:

...
mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104
ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L
...

becomes:

...
mRNA|AF161376
mRNA|AF161376
...
cnvlib.target.shortest_name(names)[source]

Return the shortest trimmed name from the given set.

Helper modules

core

CNV utilities.

cnvlib.core.assert_equal(msg, **values)[source]

Evaluate and compare two or more values for equality.

Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.

Example:

>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy']))
...
ValueError: Mismatch: expected = 1, saw = 2
cnvlib.core.call_quiet(*args)[source]

Safely run a command and get stdout; print stderr if there’s an error.

Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.

cnvlib.core.check_unique(items, title)[source]

Ensure all items in an iterable are identical; return that one item.

cnvlib.core.ensure_path(fname)[source]

Create dirs and move an existing file to avoid overwriting, if necessary.

If a file already exists at the given path, it is renamed with an integer suffix to clear the way.

cnvlib.core.fbase(fname)[source]

Strip directory and all extensions from a filename.

cnvlib.core.safe_write(*args, **kwds)[source]

Write to a filename or file-like object with error handling.

If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.

cnvlib.core.temp_write_text(*args, **kwds)[source]

Save text to a temporary file.

NB: This won’t work on Windows b/c the file stays open.

cnvlib.core.write_dataframe(outfname, dframe, header=True)[source]

Write a pandas.DataFrame to a tabular file.

cnvlib.core.write_text(outfname, text, *more_texts)[source]

Write one or more strings (blocks of text) to a file.

cnvlib.core.write_tsv(outfname, rows, colnames=None)[source]

Write rows, with optional column header, to tabular file.

descriptives

Robust estimators of central tendency and scale.

See:
https://en.wikipedia.org/wiki/Robust_measures_of_scale https://astropy.readthedocs.io/en/latest/_modules/astropy/stats/funcs.html
cnvlib.descriptives.biweight_location(a, initial=None, c=6.0, epsilon=0.001, max_iter=5)[source]

Compute the biweight location for an array.

The biweight is a robust statistic for estimating the central location of a distribution.

cnvlib.descriptives.biweight_midvariance(a, initial=None, c=9.0, epsilon=0.001)[source]

Compute the biweight midvariance for an array.

The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.

See:

cnvlib.descriptives.gapper_scale(a)[source]

Scale estimator based on gaps between order statistics.

See:

  • Wainer & Thissen (1976)
  • Beers, Flynn, and Gebhardt (1990)
cnvlib.descriptives.interquartile_range(a)[source]

Compute the difference between the array’s first and third quartiles.

cnvlib.descriptives.mean_squared_error(a, initial=None)[source]

Mean squared error (MSE).

By default, assume the input array a is the residuals/deviations/error, so MSE is calculated from zero. Another reference point for calculating the error can be specified with initial.

cnvlib.descriptives.median_absolute_deviation(a, scale_to_sd=True)[source]

Compute the median absolute deviation (MAD) of array elements.

The MAD is defined as: median(abs(a - median(a))).

See: https://en.wikipedia.org/wiki/Median_absolute_deviation

cnvlib.descriptives.modal_location(a)[source]

Return the modal value of an array’s values.

The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.

Parameters:a (np.array) – A 1-D array of floating-point values, e.g. bin log2 ratio values.
cnvlib.descriptives.narray(a)[source]

Ensure a is a numpy array with no missing/NaN values.

cnvlib.descriptives.q_n(a)[source]

Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.

Qn := Cn first quartile of (|x_i - x_j|: i < j)

where Cn is a constant depending on n.

Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.

n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
cnvlib.descriptives.warray(a, w)[source]

Ensure a and w are equal-length numpy arrays with no NaN values.

For weighted descriptives – a is the array of values, w is weights.

  1. Drop any cells in a that are NaN from both a and w
  2. Replace any remaining NaN cells in w with 0.
cnvlib.descriptives.weighted_median(a, weights)[source]

Weighted median of a 1-D numeric array.

parallel

Utilities for multi-core parallel processing.

class cnvlib.parallel.SerialFuture(result)[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Future interface.

result()[source]
class cnvlib.parallel.SerialPool[source]

Bases: future.types.newobject.newobject

Mimic the concurrent.futures.Executor interface, but run in serial.

map(func, iterable)[source]

Just apply the function to iterable.

shutdown(wait=True)[source]

Do nothing.

submit(func, *args)[source]

Just call the function on the arguments.

cnvlib.parallel.pick_pool(*args, **kwds)[source]
cnvlib.parallel.rm(path)[source]

Safely remove a file.

cnvlib.parallel.to_chunks(bed_fname, chunk_size=5000)[source]

Split the bed-file into chunks for parallelization

params

Defines several constants used in the pipeline.

Hard-coded parameters for CNVkit. These should not change between runs.

plots

Plotting utilities.

cnvlib.plots.chromosome_sizes(probes, to_mb=False)[source]

Create an ordered mapping of chromosome names to sizes.

cnvlib.plots.cvg2rgb(cvg, desaturate)[source]

Choose a shade of red or blue representing log2-coverage value.

cnvlib.plots.gene_coords_by_name(probes, names)[source]

Find the chromosomal position of each named gene in probes.

Returns:Of: {chromosome: [(start, end, gene name), ...]}
Return type:dict
cnvlib.plots.gene_coords_by_range(probes, chrom, start, end, ignore=('-', '.', 'CGH'))[source]

Find the chromosomal position of all genes in a range.

Returns:Of: {chromosome: [(start, end, gene), ...]}
Return type:dict
cnvlib.plots.partition_by_chrom(chrom_snvs)[source]

Group the tumor shift values by chromosome (for statistical testing).

cnvlib.plots.plot_x_dividers(axis, chrom_sizes, pad=None)[source]

Plot vertical dividers and x-axis labels given the chromosome sizes.

Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.

Returns:A table of the x-position offsets of each chromosome.
Return type:OrderedDict
cnvlib.plots.test_loh(bins, alpha=0.0025)[source]

Test each chromosome’s SNP shifts and the combined others’.

The statistical test is Mann-Whitney, a one-sided non-parametric test for difference in means.

samutil

BAM utilities.

cnvlib.samutil.bam_total_reads(bam_fname)[source]

Count the total number of mapped reads in a BAM file.

Uses the BAM index to do this quickly.

cnvlib.samutil.ensure_bam_index(bam_fname)[source]

Ensure a BAM file is indexed, to enable fast traversal & lookup.

For MySample.bam, samtools will look for an index in these files, in order:

  • MySample.bam.bai
  • MySample.bai
cnvlib.samutil.ensure_bam_sorted(bam_fname, by_name=False, span=50)[source]

Test if the reads in a BAM file are sorted as expected.

by_name=True: reads are expected to be sorted by query name. Consecutive read IDs are in alphabetical order, and read pairs appear together.

by_name=False: reads are sorted by position. Consecutive reads have increasing position.

cnvlib.samutil.get_read_length(bam, span=1000)[source]

Get (median) read length from first few reads in a BAM file.

Illumina reads all have the same length; other sequencers might not.

Parameters:
  • bam (str or pysam.Samfile) – Filename or pysam-opened BAM file.
  • n (int) – Number of reads used to calculate median read length.
cnvlib.samutil.idxstats(bam_fname, drop_unmapped=False)[source]

Get chromosome names, lengths, and number of mapped/unmapped reads.

Use the BAM index (.bai) to get the number of reads and size of each chromosome. Contigs with no mapped reads are skipped.

cnvlib.samutil.is_newer_than(target_fname, orig_fname)[source]

Compare file modification times.

segfilters

Filter copy number segments.

cnvlib.segfilters.ampdel(segarr)[source]

Merge segments by amplified/deleted/neutral copy number status.

Follow the clinical reporting convention:

  • 5+ copies (2.5-fold gain) is amplification
  • 0 copies is homozygous/deep deletion
  • CNAs of lesser degree are not reported

This is recommended only for selecting segments corresponding to actionable, usually focal, CNAs. Real and potentially informative but lower-level CNAs will be merged together.

cnvlib.segfilters.bic(segarr)[source]

Merge segments by Bayesian Information Criterion.

See: BIC-seq (Xi 2011), doi:10.1073/pnas.1110574108

cnvlib.segfilters.ci(segarr)[source]

Merge segments by confidence interval (overlapping 0).

Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.

cnvlib.segfilters.cn(segarr)[source]

Merge segments by integer copy number.

cnvlib.segfilters.enumerate_changes(levels)[source]

Assign a unique integer to each run of identical values.

Repeated but non-consecutive values will be assigned different integers.

cnvlib.segfilters.require_column(*colnames)[source]

Wrapper to coordinate the segment-filtering functions.

Verify that the given columns are in the CopyNumArray the wrapped function takes. Also log the number of rows in the array before and after filtration.

cnvlib.segfilters.sem(segarr)[source]

Merge segments by Standard Error of the Mean (SEM).

Use each segment’s SEM value to estimate a 95% confidence interval (via zscore). Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.

cnvlib.segfilters.squash_by_groups(cnarr, levels)[source]

Reduce CopyNumArray rows to a single row within each given level.

cnvlib.segfilters.squash_region(cnarr)[source]

Reduce a CopyNumArray to 1 row, keeping fields sensible.

Most fields added by the segmetrics command will be dropped.

smoothing

Signal smoothing functions.

cnvlib.smoothing.check_inputs(x, width)[source]

Transform width into a half-window size.

width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.

cnvlib.smoothing.fit_edges(x, y, wing, polyorder=3)[source]

Apply polynomial interpolation to the edges of y, in-place.

Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.

cnvlib.smoothing.outlier_iqr(a, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.

cnvlib.smoothing.outlier_mad_median(a)[source]

MAD-Median rule for detecting outliers.

X_i is an outlier if:

 | X_i - M |
_____________  > K ~= 2.24

 MAD / 0.6745

where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.

This is a very robust rule with the highest possible breakdown point of 0.5.

Returns:A boolean array of the same size as a, where outlier indices are True.
Return type:np.array

References

  • Davies & Gather (1993) The Identification of Multiple Outliers.
  • Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
cnvlib.smoothing.rolling_median(x, width)[source]

Rolling median with mirrored edges.

cnvlib.smoothing.rolling_outlier_iqr(x, width, c=3.0)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).

cnvlib.smoothing.rolling_outlier_quantile(x, width, q, m)[source]

Detect outliers by multiples of a quantile in a window.

Outliers are the array elements outside m times the q‘th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)

This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.

Returns:A boolean array of the same size as x, where outlier indices are True.
Return type:np.array
cnvlib.smoothing.rolling_outlier_std(x, width, stdevs)[source]

Detect outliers by stdev within a rolling window.

Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.

Returns:A boolean array of the same size as x, where outlier indices are True.
Return type:np.array
cnvlib.smoothing.rolling_quantile(x, width, quantile)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.rolling_std(x, width)[source]

Rolling quantile (0–1) with mirrored edges.

cnvlib.smoothing.smoothed(x, width, do_fit_edges=False)[source]

Smooth the values in x with the Kaiser windowed filter.

See: https://en.wikipedia.org/wiki/Kaiser_window

Parameters:
  • x (array-like) – 1-dimensional numeric data set.
  • width (float) – Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.