cnvlib package¶
Module cnvlib contents¶
The one function exposed at the top level, read, loads a file in CNVkit’s BED-like tabular format and returns a CopyNumArray instance. For your own scripting, you can usually accomplish what you need using just the CopyNumArray and GenomicArray methods available on this returned object (see Core classes).
To load other file formats, see Tabular file I/O (tabio). To run functions equivalent to CNVkit commands within Python, see Interface to CNVkit sub-commands.
Core classes¶
The core objects used throughout CNVkit. The base class is GenomicArray from
scikit-genome package. All of these classes wrap a pandas DataFrame instance, which is accessible through
the .data attribute and can be used for any manipulations that aren’t
already provided by methods in the wrapper class.
cnary¶
CNVkit’s core data structure, a copy number array.
- class cnvlib.cnary.CopyNumArray(data_table: DataFrame, meta_dict: dict[str, str] | None = None)[source]¶
Bases:
GenomicArrayAn array of genomic intervals, treated like aCGH probes.
Required columns: chromosome, start, end, gene, log2
Optional columns: gc, rmask, spread, weight, probes
- property log2¶
- chr_x_filter(diploid_parx_genome: str | None = None) Series[source]¶
All regions on X, potentially without PAR1/2.
- chr_y_filter(diploid_parx_genome: str | None = None) Series[source]¶
All regions on Y, potentially without PAR1/2.
- autosomes(diploid_parx_genome: str | None = None, also: Series | None = None) CopyNumArray[source]¶
Overrides GenomeArray.autosomes().
- by_gene(ignore: tuple[str, str, str] = ('-', '.', 'CGH')) Iterator[tuple[str, CopyNumArray]][source]¶
Iterate over probes grouped by gene name.
Group each series of intergenic bins as an “Antitarget” gene; any “Antitarget” bins within a gene are grouped with that gene.
Bins’ gene names are split on commas to accommodate overlapping genes and bins that cover multiple genes.
- center_all(estimator: Union[Callable, str]=<function Series.median>, by_chrom: bool = True, skip_low: bool = False, verbose: bool = False, diploid_parx_genome: str | None = None) None[source]¶
Re-center log2 values to the autosomes’ average (in-place).
- Parameters:
estimator (str or callable) – Function to estimate central tendency. If a string, must be one of ‘mean’, ‘median’, ‘mode’, ‘biweight’ (for biweight location). Median by default.
skip_low (bool) – Whether to drop very-low-coverage bins (via drop_low_coverage) before estimating the center value.
by_chrom (bool) – If True, first apply estimator to each chromosome separately, then apply estimator to the per-chromosome values, to reduce the impact of uneven targeting or extreme aneuploidy. Otherwise, apply estimator to all log2 values directly.
diploid_parx_genome (String) – Whether to include the PAR1/2 on chr X from the given genome (build) as part of the autosomes
- drop_low_coverage(verbose: bool = False) CopyNumArray[source]¶
Drop bins with extremely low log2 coverage or copy ratio values.
These are generally bins that had no reads mapped due to sample-specific issues. A very small log2 ratio or coverage value may have been substituted to avoid domain or divide-by-zero errors.
- squash_genes(summary_func: Callable = <function biweight_location>, squash_antitarget: bool = False, ignore: tuple[str, str, str]=('-', '.', 'CGH')) CopyNumArray[source]¶
Combine consecutive bins with the same targeted gene name.
- Parameters:
summary_func (callable) – Function to summarize an array of log2 values to produce a new log2 value for a “squashed” (i.e. reduced) region. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.
squash_antitarget (bool) – If True, also reduce consecutive “Antitarget” bins into a single bin. Otherwise, keep “Antitarget” and ignored bins as they are in the output.
ignore (list or tuple of str) – Bin names to be treated as “Antitarget” instead of as unique genes.
- Returns:
Another, usually smaller, copy of self with each gene’s bins reduced to a single bin with appropriate values.
- Return type:
- shift_xx(is_haploid_x_reference: bool = False, is_xx: bool_ | None = None, diploid_parx_genome: str | None = None) CopyNumArray[source]¶
Adjust chrX log2 ratios to match the ploidy of the reference sex.
I.e. add 1 to chrX log2 ratios for a male sample vs. female reference, or subtract 1 for a female sample vs. male reference, so that chrX log2 values are comparable across samples with different chromosomal sex.
- guess_xx(is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None, verbose: bool = True) bool_ | None[source]¶
Detect chromosomal sex; return True if a sample is probably female.
Uses compare_sex_chromosomes to calculate coverage ratios of the X and Y chromosomes versus autosomes.
- Parameters:
is_haploid_x_reference (bool) – Was this sample normalized to a male reference copy number profile?
verbose (bool) – If True, print (i.e. log to console) the ratios of the log2 coverages of the X and Y chromosomes versus autosomes, the “maleness” ratio of male vs. female expectations for each sex chromosome, and the inferred chromosomal sex.
- Returns:
True if the coverage ratios indicate the sample is female.
- Return type:
- compare_sex_chromosomes(is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None, skip_low: bool = False) tuple[bool_, dict[str, float64 | float]] | tuple[bool_, dict[str, float64]] | tuple[None, dict[Any, Any]][source]¶
Compare coverage ratios of sex chromosomes versus autosomes.
Perform 4 Mood’s median tests of the log2 coverages on chromosomes X and Y, separately shifting for assumed male and female chromosomal sex. Compare the chi-squared values obtained to infer whether the male or female assumption fits the data better.
- Parameters:
is_haploid_x_reference (bool) – Whether a male reference copy number profile was used to normalize the data. If so, a male sample should have log2 values of 0 on X and Y, and female +1 on X, deep negative (below -3) on Y. Otherwise, a male sample should have log2 values of -1 on X and 0 on Y, and female 0 on X, deep negative (below -3) on Y.
skip_low (bool) – If True, drop very-low-coverage bins (via drop_low_coverage) before comparing log2 coverage ratios. Included for completeness, but shouldn’t affect the result much since the M-W test is nonparametric and p-values are not used here.
- Returns:
bool – True if the sample appears male.
dict – Calculated values used for the inference: relative log2 ratios of chromosomes X and Y versus the autosomes; the Mann-Whitney U values from each test; and ratios of U values for male vs. female assumption on chromosomes X and Y.
- expect_flat_log2(is_haploid_x_reference: bool | None = None, diploid_parx_genome: str | None = None) ndarray[source]¶
Get the uninformed expected copy ratios of each bin.
Create an array of log2 coverages like a “flat” reference.
This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).
- residuals(segments: CopyNumArray | GenomicArray | None = None) Series[source]¶
Difference in log2 value of each bin from its segment mean.
- Parameters:
segments (GenomicArray, CopyNumArray, or None) –
Determines the “mean” value to which self log2 values are relative:
If CopyNumArray, use the log2 values as the segment means to subtract.
If GenomicArray with no log2 values, group self by these ranges and subtract each group’s median log2 value.
If None, subtract each chromosome’s median.
- Returns:
Residual log2 values from self relative to segments; same length as self.
- Return type:
array
vary¶
An array of genomic intervals, treated as variant loci.
- class cnvlib.vary.VariantArray(data_table: DataFrame, meta_dict: dict[str, str] | None = None)[source]¶
Bases:
GenomicArrayAn array of genomic intervals, treated as variant loci.
Required columns: chromosome, start, end, ref, alt
- baf_by_ranges(ranges, summary_func=<function nanmedian>, above_half=None, tumor_boost=False)[source]¶
Aggregate variant (b-allele) frequencies in each given bin.
Get the average BAF in each of the bins of another genomic array: BAFs are mirrored above/below 0.5 (per above_half), grouped in each bin of ranges, and summarized into one value per bin with summary_func (default median).
- Parameters:
ranges (GenomicArray or subclass) – Bins for grouping the variants in self.
above_half (bool) – The same as in mirrored_baf.
tumor_boost (bool) – The same as in mirrored_baf.
- Returns:
Average b-allele frequency in each range; same length as ranges. May contain NaN values where no variants overlap a range.
- Return type:
float array
- zygosity_from_freq(het_freq=0.0, hom_freq=1.0)[source]¶
Set zygosity (genotype) according to allele frequencies.
Creates or replaces ‘zygosity’ column if ‘alt_freq’ column is present, and ‘n_zygosity’ if ‘n_alt_freq’ is present.
- heterozygous() VariantArray[source]¶
Subset to only heterozygous variants.
Use ‘zygosity’ or ‘n_zygosity’ genotype values (if present) to exclude variants with value 0.0 or 1.0. If these columns are missing, or there are no heterozygous variants, then return the full (input) set of variants.
- Returns:
The subset of self with heterozygous genotype, or allele frequency between the specified thresholds.
- Return type:
- mirrored_baf(above_half=None, tumor_boost=False)[source]¶
Mirrored B-allele frequencies (BAFs).
- Parameters:
above_half (bool or None) – If specified, flip BAFs to be all above 0.5 (True) or below 0.5 (False), respectively, for consistency. Otherwise, if None, mirror in the direction of the majority of BAFs.
tumor_boost (bool) – Normalize tumor-sample allele frequencies to the matched normal sample’s allele frequencies.
- Returns:
Mirrored b-allele frequencies, the same length as self. May contain NaN values.
- Return type:
float array
Interface to CNVkit sub-commands¶
commands¶
The public API for each of the commands defined in the CNVkit workflow.
Command-line interface and corresponding API for CNVkit.
- cnvlib.commands.do_target(bait_arr: GenomicArray, annotate: str | None = None, do_short_names: bool = False, do_split: bool = False, avg_size: int | float = 266.6666666666667) GenomicArray[source]¶
Transform bait intervals into targets more suitable for CNVkit.
- Parameters:
bait_arr (GenomicArray) – Bait intervals from a BED or interval file.
annotate (str, optional) – Path to annotation file (BED, GFF, refFlat, etc.) to assign gene names to target regions.
do_short_names (bool, optional) – Reduce multi-accession bait labels to be short and consistent. Default is False.
do_split (bool, optional) – Split large target intervals into smaller pieces. Default is False.
avg_size (float, optional) – Average target size when splitting large intervals. Default is 200/0.75 (~267 bp).
- Returns:
Processed target intervals ready for CNVkit analysis.
- Return type:
- cnvlib.commands.do_access(fa_fname: str, exclude_fnames: Iterable[str] = (), min_gap_size: int = 5000, skip_noncanonical: bool = True) GenomicArray[source]¶
List the locations of accessible sequence regions in a FASTA file.
- Parameters:
fa_fname (str) – Path to FASTA file to analyze.
exclude_fnames (Iterable[str], optional) – Paths to BED files of regions to exclude from accessibility.
min_gap_size (int, optional) – Minimum gap size between accessible regions to keep them separate. Default is 5000.
skip_noncanonical (bool, optional) – Skip non-canonical chromosomes (e.g., chr*_random, chrUn_*). Default is True.
- Returns:
Accessible genomic regions.
- Return type:
- cnvlib.commands.do_antitarget(targets: GenomicArray, access: GenomicArray | None = None, avg_bin_size: int = 150000, min_bin_size: int | None = None) GenomicArray[source]¶
Derive off-target (“antitarget”) bins from target regions.
- Parameters:
targets (GenomicArray) – Target genomic regions.
access (GenomicArray, optional) – Accessible genomic regions to constrain antitarget bins.
avg_bin_size (int, optional) – Average size for antitarget bins. Default is 150000.
min_bin_size (int, optional) – Minimum size for antitarget bins. If not specified, calculated as 2 * avg_bin_size * (2^MIN_REF_COVERAGE).
- Returns:
Antitarget genomic regions.
- Return type:
- cnvlib.commands.do_autobin(bam_fname: str, method: str, targets: GenomicArray | None = None, access: GenomicArray | None = None, bp_per_bin: float = 100000.0, target_min_size: int = 20, target_max_size: int = 50000, antitarget_min_size: int = 500, antitarget_max_size: int = 1000000, fasta: str | None = None) tuple[tuple[float, int], tuple[float, int]][source]¶
Quickly calculate reasonable bin sizes from BAM read counts.
- Parameters:
bam_fname (str) – Path to BAM file.
method (str) – Sequencing method: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), or ‘hybrid’ (hybridization capture).
targets (GenomicArray, optional) – Targeted genomic regions (required for ‘hybrid’ and ‘amplicon’).
access (GenomicArray, optional) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
bp_per_bin (float, optional) – Desired number of sequencing read nucleotide bases mapped to each bin. Default is 100000.0.
target_min_size (int, optional) – Minimum target bin size. Default is 20.
target_max_size (int, optional) – Maximum target bin size. Default is 50000.
antitarget_min_size (int, optional) – Minimum antitarget bin size. Default is 500.
antitarget_max_size (int, optional) – Maximum antitarget bin size. Default is 1000000.
fasta (str, optional) – Path to reference genome FASTA file.
- Returns:
Nested tuple: ((target depth, target avg. bin size), (antitarget depth, antitarget avg. bin size)).
- Return type:
- Raises:
ValueError – If targets are required for the method but not provided or empty.
- cnvlib.commands.do_coverage(bed_fname: str, bam_or_bg_fname: str, by_count: bool = False, min_mapq: int = 0, processes: int = 1, fasta: str | None = None) CopyNumArray[source]¶
Calculate coverage in the given regions from BAM read depths.
- Parameters:
bed_fname (str) – Path to BED file defining regions to measure coverage.
bam_or_bg_fname (str) – Path to BAM file containing aligned reads or bedGraph file (.bed.gz).
by_count (bool, optional) – Calculate coverage by read count instead of read depth. Default is False. Ignored for bedGraph input.
min_mapq (int, optional) – Minimum mapping quality score to include a read. Default is 0. Ignored for bedGraph input.
processes (int, optional) – Number of parallel processes to use. Default is 1. Ignored for bedGraph input.
fasta (str, optional) – Path to reference genome FASTA file.
- Returns:
Coverage values for each region.
- Return type:
- Raises:
RuntimeError – If the BAM file is not sorted by coordinates.
FileNotFoundError – If bedGraph file lacks required tabix index.
- cnvlib.commands.do_reference(target_fnames: list[str], antitarget_fnames: list[str] | None = None, fa_fname: str | None = None, is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None, female_samples: bool | None = None, do_gc: bool = True, do_edge: bool = True, do_rmask: bool = True, do_cluster: bool = False, min_cluster_size: int = 4) CopyNumArray[source]¶
Compile a pooled coverage reference from normal samples.
Creates a reference profile by combining coverage data from multiple normal (control) samples. The reference represents the expected neutral copy number baseline and is used to normalize tumor/test sample coverage for CNV detection.
The reference includes per-bin coverage values, GC content, RepeatMasker information, and other features used for bias correction. Sample sex is either provided or automatically inferred from X/Y chromosome coverage.
- Parameters:
target_fnames (list of str) – Paths to target coverage files (.targetcoverage.cnn or .cnn) from normal samples. These are typically generated by the coverage command on on-target/baited regions.
antitarget_fnames (list of str, optional) – Paths to antitarget coverage files (.antitargetcoverage.cnn) from the same normal samples. Must match target_fnames in order and count. If None, only targets are used (e.g., for amplicon sequencing). Default: None
fa_fname (str, optional) – Path to reference genome FASTA file. Required for GC content and RepeatMasker calculations. If None, GC and RM corrections are skipped. Default: None
is_haploid_x_reference (bool, optional) – True if the reference samples are male (haploid X chromosome). Affects expected X chromosome coverage normalization. Default: False (assumes female reference or mixed)
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). PAR regions on X/Y are treated as diploid even in male samples. Default: None
female_samples (bool, optional) – Explicit sex designation for all samples. If None, sex is inferred from X/Y chromosome coverage patterns automatically. True = all female, False = all male. Default: None (auto-infer)
do_gc (bool, optional) – Calculate and include GC content for each bin from FASTA. Requires fa_fname. Used for GC bias correction. Default: True
do_edge (bool, optional) – Calculate and include “edge effect” metric (bin distance from interval edges). Used for edge bias correction. Default: True
do_rmask (bool, optional) – Calculate and include RepeatMasker content for each bin from FASTA. Requires fa_fname. Used for repeat region bias correction. Default: True
do_cluster (bool, optional) – Apply hierarchical clustering to identify and exclude outlier samples. Helps create a more robust reference when sample quality varies. Default: False
min_cluster_size (int, optional) – Minimum cluster size when do_cluster=True. Smaller clusters are considered outliers and excluded. Default: 4
- Returns:
Pooled reference with combined coverage values, GC content, RepeatMasker content, and other features. This .cnn file is used as input to the fix command for normalizing tumor samples.
- Return type:
Notes
The reference creation process:
Sex inference (if not provided): Examines X/Y chromosome coverage ratios in each sample to determine chromosomal sex.
Coverage pooling: Combines coverage across samples using robust statistics (typically median).
Bias feature calculation: Extracts GC%, RepeatMasker%, edge effects from the reference genome FASTA.
Optional clustering: Identifies and removes outlier samples.
The resulting reference should be reused for all samples prepared with the same capture kit, sequencing platform, and protocol.
Warnings are issued for bins with unusually low or zero coverage, which may indicate problematic baited regions.
See also
combine_probesInternal function that performs the actual reference creation
infer_sexesInfers sample sex from X/Y chromosome coverage
warn_bad_binsWarns about bins with problematic coverage
Examples
Basic reference from normal samples: >>> ref = do_reference( … [‘N1.targetcoverage.cnn’, ‘N2.targetcoverage.cnn’], … [‘N1.antitargetcoverage.cnn’, ‘N2.antitargetcoverage.cnn’], … fa_fname=’hg19.fa’ … )
Reference without FASTA (no GC/RM correction): >>> ref = do_reference( … [‘N1.targetcoverage.cnn’, ‘N2.targetcoverage.cnn’], … [‘N1.antitargetcoverage.cnn’, ‘N2.antitargetcoverage.cnn’] … )
Reference with explicit male samples: >>> ref = do_reference( … target_fnames=[‘M1.cnn’, ‘M2.cnn’], … fa_fname=’hg38.fa’, … female_samples=False, … diploid_parx_genome=’hg38’ … )
- cnvlib.commands.do_reference_flat(targets: str, antitargets: str | None = None, fa_fname: None = None, is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None) CopyNumArray[source]¶
Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
- Parameters:
targets (str) – Path to BED file with target intervals.
antitargets (str, optional) – Path to BED file with antitarget intervals.
fa_fname (str, optional) – Path to reference genome FASTA file for GC/RepeatMasker calculations.
is_haploid_x_reference (bool, optional) – True if reference should use male (haploid X) chromosome coverage. Default is False.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
- Returns:
Neutral reference with flat (expected neutral) log2 coverage values and calculated GC/RepeatMasker features.
- Return type:
- cnvlib.commands.do_fix(target_raw: CopyNumArray, antitarget_raw: CopyNumArray, reference: CopyNumArray, diploid_parx_genome: str | None = None, do_gc: bool = True, do_edge: bool = True, do_rmask: bool = True, do_cluster: bool = False, smoothing_window_fraction: None = None) CopyNumArray[source]¶
Normalize tumor/test sample coverage using a reference and correct biases.
This is the core normalization step in the CNVkit pipeline. It combines target and antitarget coverage data, applies the reference baseline, corrects for systematic biases (GC content, edge effects, RepeatMasker), and produces log2 copy ratios ready for segmentation.
The “fix” name refers to fixing/correcting coverage biases.
- Parameters:
target_raw (CopyNumArray) – Raw coverage data for on-target/baited regions from the test sample. Typically a .targetcoverage.cnn file from the coverage command.
antitarget_raw (CopyNumArray) – Raw coverage data for off-target/antitarget regions from the test sample. Typically a .antitargetcoverage.cnn file from the coverage command. Can be empty for amplicon sequencing.
reference (CopyNumArray) – Pooled reference created from normal samples using do_reference. Contains expected neutral coverage and bias correction features.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). Ensures PAR regions on X/Y chromosomes are treated as diploid during normalization. Default: None
do_gc (bool, optional) – Apply GC bias correction using GC content from reference. Corrects for systematic coverage variations due to GC%. Default: True
do_edge (bool, optional) – Apply edge effect correction for target bins. Corrects for reduced coverage near bait interval boundaries. Default: True
do_rmask (bool, optional) – Apply RepeatMasker correction for antitarget bins. Corrects for coverage variations in repetitive regions. Default: True
do_cluster (bool, optional) – If the reference contains multiple sub-clusters (from hierarchical clustering during reference creation), select the cluster with highest correlation to this sample. Default: False
smoothing_window_fraction (None, optional) – Reserved parameter for future smoothing functionality. Currently not implemented. Default: None
- Returns:
Normalized copy number ratios (.cnr file) with log2 ratios centered around 0 (neutral copy number). Ready for segmentation with the segment command.
- Return type:
Notes
The normalization process:
Bias correction (separately for targets and antitargets):
GC correction: Adjusts for GC content-dependent coverage bias
Edge correction: Adjusts for lower coverage near bait edges (targets only)
RepeatMasker correction: Adjusts for repeats (antitargets only)
Reference subtraction:
Subtracts reference log2 values from sample log2 values
If do_cluster=True, selects best-matching reference cluster
Weighting:
Applies statistical weights based on coverage variability
Down-weights unreliable bins
Centering:
Centers the genome-wide distribution around log2=0
Accounts for sex chromosomes and PAR regions
The output .cnr file contains one row per bin with columns:
chromosome, start, end, gene: Genomic location
log2: Copy ratio in log2 scale (0 = diploid, +1 = gain, -1 = loss)
depth: Read depth
weight: Statistical weight for segmentation
See also
do_referenceCreates the reference used for normalization
load_adjust_coveragesInternal function that performs bias correction
apply_weightsCalculates statistical weights for bins
Examples
Basic usage: >>> fixed = do_fix( … target_raw=read_cna(‘Sample.targetcoverage.cnn’), … antitarget_raw=read_cna(‘Sample.antitargetcoverage.cnn’), … reference=read_cna(‘reference.cnn’) … )
With cluster selection: >>> fixed = do_fix( … target_raw, antitarget_raw, reference, … do_cluster=True # Select best-matching reference cluster … )
For male sample with PAR handling: >>> fixed = do_fix( … target_raw, antitarget_raw, reference, … diploid_parx_genome=’hg38’ … )
- cnvlib.commands.do_segmentation(cnarr: CNA, method: str, diploid_parx_genome: str | None = None, threshold: float | None = None, variants: VariantArray | None = None, skip_low: bool = False, skip_outliers: int = 10, min_weight: int = 0, save_dataframe: bool = False, rscript_path: str = 'Rscript', processes: int = 1, smooth_cbs: bool = False) CNA[source]¶
Infer copy number segments from the given coverage table.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number ratios (.cnr file).
method (str) – Segmentation algorithm: ‘cbs’, ‘flasso’, ‘haar’, ‘hmm’, ‘hmm-tumor’, or ‘hmm-germline’.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
threshold (float, optional) – Significance threshold (for CBS/flasso/haar) or smoothing window size (for HMM methods). If None, uses method-specific defaults.
variants (VariantArray, optional) – Variant allele frequencies to incorporate into HMM segmentation.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
skip_outliers (int, optional) – Skip bins with log2 ratios more than this many standard deviations from the chromosome arm mean. Default is 10.
min_weight (int, optional) – Minimum weight threshold for including bins. Default is 0.
save_dataframe (bool, optional) – Return the R dataframe as a string along with segments. Default is False.
rscript_path (str, optional) – Path to Rscript executable for CBS/flasso methods. Default is “Rscript”.
processes (int, optional) – Number of parallel processes to use. Default is 1.
smooth_cbs (bool, optional) – Apply smoothing to CBS segmentation results. Default is False.
- Returns:
Segmented copy number data (.cns format). If save_dataframe=True, returns (segments, R dataframe string).
- Return type:
- Raises:
ValueError – If method is not one of the supported segmentation methods.
- cnvlib.commands.do_call(cnarr: CopyNumArray, variants: VariantArray | None = None, method: str = 'threshold', ploidy: int = 2, purity: float | None = None, is_haploid_x_reference: bool = False, is_sample_female: bool = False, diploid_parx_genome: str | None = None, filters: list[str] | None = None, thresholds: tuple[float, float, float, float] | ndarray = (-1.1, -0.25, 0.2, 0.7)) CopyNumArray[source]¶
Assign absolute integer copy number to each segment.
This is the main API function for calling absolute copy numbers from segmented log2 ratios. It supports multiple calling methods and can optionally incorporate variant allele frequencies and tumor purity information.
- Parameters:
cnarr (CopyNumArray) – Segmented copy number data (.cns file), typically from the segment command. Should contain ‘log2’ column with copy ratio values.
variants (VariantArray, optional) – Variant allele frequency data from VCF, used to call allele-specific copy numbers. If provided, ‘baf’ (B-allele frequency) will be added to the output, and ‘cn1’/’cn2’ (major/minor allele copy numbers) will be calculated.
method (str, optional) – Calling method to use. Options: - ‘threshold’: Apply log2 ratio thresholds (default) - ‘clonal’: Assume pure/clonal sample, infer from ploidy - ‘none’: Skip copy number calling, only apply filters Default: ‘threshold’
ploidy (int, optional) – Expected baseline ploidy of the sample. Usually 2 for diploid. Default: 2
purity (float, optional) – Estimated tumor purity (0.0 to 1.0). If provided and < 1.0, log2 ratios will be rescaled to account for normal cell contamination. Default: None (assume pure sample)
is_haploid_x_reference (bool, optional) – Whether the reference sample is male (haploid X chromosome). Affects X chromosome copy number interpretation. Default: False
is_sample_female (bool, optional) – Whether the test sample is female. Used with purity rescaling. Default: False
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). Treats PAR regions as diploid even in male samples. Default: None
filters (list of str, optional) – Segment filters to apply. Options include ‘ci’ (confidence interval), ‘sem’ (standard error), ‘ampdel’ (amplification/deletion), etc. See segfilters module for full list. Default: None
thresholds (tuple of 4 floats or ndarray, optional) – Log2 ratio thresholds for calling copy numbers when method=’threshold’. Format: (del_threshold, loss_threshold, gain_threshold, amp_threshold) These map to copy numbers [0, 1, 2, 3, 4+] respectively. Default: (-1.1, -0.25, 0.2, 0.7)
- Returns:
Copy of input array with added ‘cn’ column (absolute copy number). If variants provided, also includes: - ‘baf’: B-allele frequency per segment - ‘cn1’: Major allele copy number (≥ cn2) - ‘cn2’: Minor allele copy number (≤ cn1)
- Return type:
- Raises:
ValueError – If method is not one of ‘threshold’, ‘clonal’, or ‘none’.
See also
absolute_clonalCalculate absolute copy numbers for pure samples
absolute_thresholdApply log2 thresholds to call copy numbers
rescale_bafRescale B-allele frequencies for tumor purity
Examples
Basic threshold calling: >>> calls = do_call(segments, method=’threshold’)
With tumor purity and variants: >>> calls = do_call(segments, variants=vcf, purity=0.7, ploidy=2)
With custom thresholds: >>> calls = do_call(segments, thresholds=(-1.5, -0.3, 0.3, 1.0))
- cnvlib.commands.do_scatter(cnarr: CopyNumArray, segments: CopyNumArray | None = None, variants: None = None, show_range: None = None, show_gene: None = None, do_trend: bool = False, by_bin: bool = False, window_width: float = 1000000.0, y_min: None = None, y_max: None = None, fig_size: None = None, antitarget_marker: None = None, segment_color: str = 'darkorange', title: None = None) Figure[source]¶
Plot probe log2 coverages and segmentation calls together.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data to plot.
segments (CopyNumArray, optional) – Segmented copy number data to overlay.
variants (VariantArray, optional) – Variant allele frequency data to plot.
show_range (str, optional) – Genomic range to display (e.g., “chr1:1000000-2000000”).
show_gene (str, optional) – Gene name to zoom into.
do_trend (bool, optional) – Plot a smoothed trendline. Default is False.
by_bin (bool, optional) – Plot by bin index rather than genomic coordinates. Default is False.
window_width (float, optional) – Smoothing window width in bases (or bins if by_bin=True). Default is 1e6.
y_min (float, optional) – Minimum y-axis value.
y_max (float, optional) – Maximum y-axis value.
fig_size (tuple of float, optional) – Figure size as (width, height) in inches.
antitarget_marker (str, optional) – Marker style for antitarget bins.
segment_color (str, optional) – Color for segment lines. Default is SEG_COLOR.
title (str, optional) – Plot title.
- Returns:
The scatter plot figure.
- Return type:
- cnvlib.commands.do_heatmap(cnarrs, show_range=None, do_desaturate=False, by_bin=False, delim_sampl=False, vertical=False, title=None, ax=None)[source]¶
Plot copy number for multiple samples as a heatmap.
- Parameters:
cnarrs (list of CopyNumArray) – Copy number data for multiple samples.
show_range (str, optional) – Genomic range to display (e.g., “chr1:1000000-2000000”).
do_desaturate (bool, optional) – Desaturate colors in heatmap. Default is False.
by_bin (bool, optional) – Plot by bin position rather than genomic position. Default is False.
delim_sampl (bool, optional) – Draw delimiter lines between samples. Default is False.
vertical (bool, optional) – Orient heatmap vertically instead of horizontally. Default is False.
title (str, optional) – Title for the plot.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on. If None, creates new figure.
- Returns:
The axes object with the heatmap plot.
- Return type:
- cnvlib.commands.do_breaks(probes: CopyNumArray, segments: CopyNumArray, min_probes: int = 1) pd.DataFrame[source]¶
List the targeted genes in which a copy number breakpoint occurs.
- Parameters:
probes (CopyNumArray) – Bin-level copy number data.
segments (CopyNumArray) – Segmented copy number data.
min_probes (int, optional) – Minimum number of probes required on each side of the breakpoint. Default is 1.
- Returns:
Table with columns: gene, chromosome, location, change, probes_left, probes_right.
- Return type:
pd.DataFrame
- cnvlib.commands.do_genemetrics(cnarr: CopyNumArray, segments: CopyNumArray | None = None, threshold: float = 0.2, min_probes: int = 3, skip_low: bool = False, is_haploid_x_reference: bool = False, is_sample_female: None = None, diploid_parx_genome: str | None = None) pd.DataFrame[source]¶
Identify targeted genes with copy number gain or loss.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data.
segments (CopyNumArray, optional) – Segmented copy number data. If provided, metrics are calculated per segment.
threshold (float, optional) – Minimum absolute log2 ratio to consider a gene altered. Default is 0.2.
min_probes (int, optional) – Minimum number of probes required to report a gene. Default is 3.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
is_haploid_x_reference (bool, optional) – Whether reference is male (haploid X). Default is False.
is_sample_female (bool, optional) – Whether sample is female. If None, inferred from data.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
- Returns:
Table of genes with copy number alterations, including gene name, chromosome, log2 ratio, and probe counts.
- Return type:
pd.DataFrame
- cnvlib.commands.do_sex(cnarrs: list[CopyNumArray], is_haploid_x_reference: bool, diploid_parx_genome: str | None) pd.DataFrame[source]¶
Guess samples’ sex from the relative coverage of chromosomes X and Y.
- Parameters:
cnarrs (list of CopyNumArray) – Copy number data for one or more samples.
is_haploid_x_reference (bool) – Whether the reference sample is male (haploid X chromosome).
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
- Returns:
Table with columns: sample, sex, X_logratio, Y_logratio. Sex is inferred from X and Y chromosome coverage patterns.
- Return type:
pd.DataFrame
- cnvlib.commands.do_metrics(cnarrs: CopyNumArray, segments: CopyNumArray | None = None, skip_low: bool = False) pd.DataFrame[source]¶
Compute coverage deviations and other metrics for self-evaluation.
- Parameters:
cnarrs (CopyNumArray or list of CopyNumArray) – Bin-level copy number data for one or more samples.
segments (CopyNumArray, list of CopyNumArray, or None, optional) – Segmented copy number data. If None, computes metrics without segment-based residuals.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
- Returns:
Metrics table with columns: sample, segments, stdev, mad, iqr, bivar. Each row contains quality metrics for one sample.
- Return type:
pd.DataFrame
- cnvlib.commands.do_segmetrics(cnarr: CopyNumArray, segarr: CopyNumArray, location_stats: tuple[()] | list[str] = (), spread_stats: tuple[()] | list[str] = (), interval_stats: tuple[()] | list[str] = (), alpha: float = 0.05, bootstraps: int = 100, smoothed: bool = False, skip_low: bool = False) CopyNumArray[source]¶
Compute segment-level metrics from bin-level log2 ratios.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data.
segarr (CopyNumArray) – Segmented copy number data.
location_stats (list of str, optional) – Location statistics to compute: ‘mean’, ‘median’, ‘mode’, ‘p_ttest’. Default is empty tuple.
spread_stats (list of str, optional) – Spread statistics to compute: ‘stdev’, ‘mad’, ‘mse’, ‘iqr’, ‘bivar’, ‘sem’. Default is empty tuple.
interval_stats (list of str, optional) – Interval statistics to compute: ‘ci’ (confidence interval), ‘pi’ (prediction interval). Default is empty tuple.
alpha (float, optional) – Significance level for confidence/prediction intervals. Default is 0.05.
bootstraps (int, optional) – Number of bootstrap iterations for confidence intervals. Default is 100.
smoothed (bool, optional) – Use smoothed bootstrap for confidence intervals. Default is False.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
- Returns:
Segmented data with additional statistical columns.
- Return type:
- cnvlib.commands.do_bintest(cnarr: CopyNumArray, segments: CopyNumArray | None = None, alpha: float = 0.005, target_only: bool = False) CopyNumArray[source]¶
Get a probability for each bin based on its Z-score.
Adds a column with p-values to the input .cnr. With segments, the Z-score is relative to the enclosing segment’s mean, otherwise it is relative to 0.
Bin p-values are corrected for multiple hypothesis testing by the Benjamini-Hochberg method.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number ratios.
segments (CopyNumArray, optional) – Segmented copy number data to use as baseline. If provided, Z-scores are calculated relative to segment means.
alpha (float, optional) – Significance threshold for filtering bins. Default is 0.005.
target_only (bool, optional) – Test only on-target bins, excluding antitarget bins. Default is False.
- Returns:
Bins where the probability is less than alpha.
- Return type:
- cnvlib.commands.do_import_theta(segarr: CopyNumArray, theta_results_fname: str, ploidy: int = 2) Iterator[CopyNumArray][source]¶
Import absolute copy number calls from THetA results.
THetA (Tumor Heterogeneity Analysis) infers tumor purity and clonal/subclonal copy number profiles. This function converts THetA’s output into CNVkit segment format with absolute copy numbers.
- Parameters:
segarr (CopyNumArray) – Input segmented copy number data (.cns file). Typically the segmentation results before calling absolute copy numbers.
theta_results_fname (str) – Path to THetA results file (.results or .withBounds file).
ploidy (int, optional) – Expected baseline ploidy (usually 2 for diploid). Used to calculate log2 ratios from absolute copy numbers. Default: 2.
- Yields:
CopyNumArray – One or more segment arrays with ‘cn’ (absolute copy number) and recalculated ‘log2’ values. THetA may output multiple solutions, each yielded separately.
Notes
Only autosomal segments are processed; sex chromosomes are excluded because THetA doesn’t handle them well.
Segments with unknown copy number (marked as None/X in THetA output) are dropped.
Copy number 0 is treated as 0.5 for log2 calculation to avoid -inf.
See also
parse_theta_resultsParses the THetA results file format
- cnvlib.commands.do_import_rna(gene_count_fnames, in_format, gene_resource_fname, correlations_fname=None, normal_fnames=(), do_gc=True, do_txlen=True, max_log2=3, diploid_parx_genome=None)[source]¶
Convert a cohort of per-gene read counts to CNVkit .cnr format.
The expected data source is TCGA gene-level expression counts for individual samples, but other sources should be fine, too.
- Parameters:
gene_count_fnames (list of str) – Paths to gene count files for all samples.
in_format (str) – Input file format: ‘rsem’ or ‘counts’.
gene_resource_fname (str) – Path to gene metadata/resource file.
correlations_fname (str, optional) – Path to TCGA gene expression/CNV correlation profiles.
normal_fnames (tuple of str, optional) – Paths to normal/control sample count files.
do_gc (bool, optional) – Apply GC bias correction. Default is True.
do_txlen (bool, optional) – Apply transcript length correction. Default is True.
max_log2 (float, optional) – Maximum log2 ratio to cap values at. Default is 3.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’).
- Returns:
(all_data DataFrame, cnrs generator) - all_data: Summary table with gene info and log2-normalized values - cnrs: Generator of CopyNumArray objects, one per sample
- Return type:
- Raises:
RuntimeError – If input format is not recognized.
The following modules implement lower-level functionality specific to each of the CNVkit sub-commands.
access¶
List the locations of accessible sequence regions in a FASTA file.
Inaccessible regions, e.g. telomeres and centromeres, are masked out with N in the reference genome sequence; this script scans those to identify the coordinates of the accessible regions (those between the long spans of N’s).
- cnvlib.access.do_access(fa_fname: str, exclude_fnames: Iterable[str] = (), min_gap_size: int = 5000, skip_noncanonical: bool = True) GenomicArray[source]¶
List the locations of accessible sequence regions in a FASTA file.
- Parameters:
fa_fname (str) – Path to FASTA file to analyze.
exclude_fnames (Iterable[str], optional) – Paths to BED files of regions to exclude from accessibility.
min_gap_size (int, optional) – Minimum gap size between accessible regions to keep them separate. Default is 5000.
skip_noncanonical (bool, optional) – Skip non-canonical chromosomes (e.g., chr*_random, chrUn_*). Default is True.
- Returns:
Accessible genomic regions.
- Return type:
- cnvlib.access.drop_noncanonical_contigs(region_tups: Iterable[tuple]) Iterable[tuple][source]¶
Drop contigs with noncanonical names.
region_tups is an iterable of (chrom, start, end) tuples.
Yield the same, but dropping noncanonical chrom.
- cnvlib.access.get_regions(fasta_fname: str) Iterable[tuple][source]¶
Find accessible sequence regions (those not masked out with ‘N’).
- cnvlib.access.log_this(chrom: str, run_start: int, run_end: int) tuple[source]¶
Log a coordinate range, then return it as a tuple.
- cnvlib.access.join_regions(regions: GenomicArray, min_gap_size: int) tuple[source]¶
Filter regions, joining those separated by small gaps.
antitarget¶
Supporting functions for the ‘antitarget’ command.
- cnvlib.antitarget.do_antitarget(targets: GenomicArray, access: GenomicArray | None = None, avg_bin_size: int = 150000, min_bin_size: int | None = None) GenomicArray[source]¶
Derive off-target (“antitarget”) bins from target regions.
- Parameters:
targets (GenomicArray) – Target genomic regions.
access (GenomicArray, optional) – Accessible genomic regions to constrain antitarget bins.
avg_bin_size (int, optional) – Average size for antitarget bins. Default is 150000.
min_bin_size (int, optional) – Minimum size for antitarget bins. If not specified, calculated as 2 * avg_bin_size * (2^MIN_REF_COVERAGE).
- Returns:
Antitarget genomic regions.
- Return type:
- cnvlib.antitarget.get_antitargets(targets: GenomicArray, accessible: GenomicArray, avg_bin_size: int, min_bin_size: int | None) GenomicArray[source]¶
Generate antitarget intervals between/around target intervals.
Procedure:
Invert target intervals
Subtract the inverted targets from accessible regions
For each of the resulting regions:
Shrink by a fixed margin on each end
If it’s smaller than min_bin_size, skip
Divide into equal-size (region_size/avg_bin_size) portions
Emit the (chrom, start, end) coords of each portion
- cnvlib.antitarget.drop_noncanonical_contigs(accessible: GenomicArray, targets: GenomicArray, verbose: bool = True) GenomicArray[source]¶
Drop contigs that are not targeted or canonical chromosomes.
Antitargets will be binned over chromosomes that:
Appear in the sequencing-accessible regions of the reference genome sequence, and
Contain at least one targeted region, or
Are named like a canonical chromosome (1-22,X,Y for human)
This allows antitarget binning to pick up canonical chromosomes that do not contain any targets, as well as non-canonical or oddly named chromosomes that were targeted.
- cnvlib.antitarget.compare_chrom_names(a_regions: GenomicArray, b_regions: GenomicArray) tuple[set, set][source]¶
Check if chromosome names overlap, and preview each if not.
This summary message will help triage cases of e.g. “chr1” vs. “1” in the two genomic datasets being compared.
- cnvlib.antitarget.guess_chromosome_regions(targets: GenomicArray, telomere_size: int) GenomicArray[source]¶
Determine (minimum) chromosome lengths from target coordinates.
autobin¶
Estimate reasonable bin sizes from BAM read counts or depths.
- cnvlib.autobin.midsize_file(fnames: Iterable[str]) list[str][source]¶
Select the median-size file from several given filenames.
If an even number of files is given, selects the file just below the median.
- cnvlib.autobin.do_autobin(bam_fname: str, method: str, targets: GenomicArray | None = None, access: GenomicArray | None = None, bp_per_bin: float = 100000.0, target_min_size: int = 20, target_max_size: int = 50000, antitarget_min_size: int = 500, antitarget_max_size: int = 1000000, fasta: str | None = None) tuple[tuple[float, int], tuple[float, int]][source]¶
Quickly calculate reasonable bin sizes from BAM read counts.
- Parameters:
bam_fname (str) – Path to BAM file.
method (str) – Sequencing method: ‘wgs’ (whole-genome sequencing), ‘amplicon’ (targeted amplicon capture), or ‘hybrid’ (hybridization capture).
targets (GenomicArray, optional) – Targeted genomic regions (required for ‘hybrid’ and ‘amplicon’).
access (GenomicArray, optional) – Sequencing-accessible regions of the reference genome (for ‘hybrid’ and ‘wgs’).
bp_per_bin (float, optional) – Desired number of sequencing read nucleotide bases mapped to each bin. Default is 100000.0.
target_min_size (int, optional) – Minimum target bin size. Default is 20.
target_max_size (int, optional) – Maximum target bin size. Default is 50000.
antitarget_min_size (int, optional) – Minimum antitarget bin size. Default is 500.
antitarget_max_size (int, optional) – Maximum antitarget bin size. Default is 1000000.
fasta (str, optional) – Path to reference genome FASTA file.
- Returns:
Nested tuple: ((target depth, target avg. bin size), (antitarget depth, antitarget avg. bin size)).
- Return type:
- Raises:
ValueError – If targets are required for the method but not provided or empty.
- cnvlib.autobin.hybrid(rc_table: DataFrame, read_len: int | float, bam_fname: str, targets: GenomicArray, access: GenomicArray | None = None, fasta: str | None = None) tuple[source]¶
Hybrid capture sequencing.
- cnvlib.autobin.average_depth(rc_table: DataFrame, read_length: int | float) float[source]¶
Estimate the average read depth across the genome.
- Returns:
Median of the per-chromosome mean read depths, weighted by chromosome size.
- Return type:
- cnvlib.autobin.idxstats2ga(table: DataFrame, bam_fname: str) GenomicArray[source]¶
- cnvlib.autobin.sample_region_cov(bam_fname: str, regions: GenomicArray, max_num: int = 100, fasta: str | None = None) float[source]¶
Calculate read depth in a randomly sampled subset of regions.
- cnvlib.autobin.sample_midsize_regions(regions: GenomicArray, max_num: int) DataFrame[source]¶
Randomly select a subset of up to max_num regions.
Intersection of DataFrame .chromosome values.
- cnvlib.autobin.update_chrom_length(rc_table: DataFrame, regions: GenomicArray | None) DataFrame[source]¶
- cnvlib.autobin.region_size_by_chrom(regions: GenomicArray) DataFrame[source]¶
- cnvlib.autobin.total_region_size(regions: GenomicArray) int[source]¶
Aggregate area of all genomic ranges in regions.
batch¶
The ‘batch’ command.
- cnvlib.batch.batch_make_reference(normal_bams: list[str], target_bed: str | None, antitarget_bed: None, is_haploid_x_reference: bool, diploid_parx_genome: str | None, fasta: str, annotate: str | None, short_names: bool, target_avg_size: int, access_bed: None, antitarget_avg_size: int | None, antitarget_min_size: int | None, output_reference: None, output_dir: str, processes: int, by_count: bool, method: str, do_cluster: bool) tuple[str, str, str][source]¶
Build a complete copy number reference from normal samples.
This is the high-level workflow function used by the batch command to create a reference. It coordinates target/antitarget preparation, coverage calculation across normal samples, and reference construction.
The function handles three sequencing methods: hybrid capture (default), whole genome sequencing (WGS), and targeted amplicon sequencing.
- Parameters:
normal_bams (list of str) – Paths to BAM files from normal/control samples. If empty, creates a “flat” reference with uniform expected coverage.
target_bed (str, optional) – Path to BED file defining target/baited regions. Required for hybrid capture and amplicon. Optional for WGS (can be auto-generated).
antitarget_bed (None) – Path to pre-computed antitarget BED file. If None (typical), antitargets are automatically generated for hybrid capture.
is_haploid_x_reference (bool) – True if reference samples are male (haploid X chromosome).
diploid_parx_genome (str, optional) – Reference genome name (e.g., ‘hg19’, ‘hg38’) for pseudo-autosomal region handling on X/Y chromosomes.
fasta (str) – Path to reference genome FASTA file. Required for GC/RepeatMasker calculations and for auto-generating WGS targets.
annotate (str, optional) – Path to gene annotation file (e.g., refFlat.txt) to add gene names to targets. Recommended for WGS.
short_names (bool) – Shorten target names to gene symbols when possible.
target_avg_size (int) – Target average bin size in base pairs. If 0 or None, automatically determined. For WGS, defaults to 5000 bp bins.
access_bed (None) – Path to sequencing-accessible regions BED file (e.g., from access command). Used for WGS to exclude low-mappability regions.
antitarget_avg_size (int, optional) – Average size for antitarget bins in bp (hybrid capture only).
antitarget_min_size (int, optional) – Minimum size for antitarget bins in bp (hybrid capture only).
output_reference (None) – Path for output reference file. If None, writes to “{output_dir}/reference.cnn”.
output_dir (str) – Directory for output files (intermediate coverage files, reference).
processes (int) – Number of parallel processes for coverage calculation. 0 = use all available CPUs.
by_count (bool) – Calculate coverage by read count instead of read depth.
method (str) – Sequencing protocol: ‘hybrid’ (default), ‘wgs’, or ‘amplicon’. Determines target/antitarget handling and default parameters.
do_cluster (bool) – Apply hierarchical clustering to identify and separate reference sample subgroups (useful for heterogeneous normal cohorts).
- Returns:
Paths to: (reference .cnn file, targets .bed file, antitargets .bed file)
- Return type:
Notes
The reference creation workflow:
Target preparation (protocol-dependent):
Hybrid capture: Uses provided target_bed, generates antitargets
WGS: Auto-generates genome-wide bins, no antitargets
Amplicon: Uses provided target_bed, no antitargets
Coverage calculation (parallel across normal samples):
Runs coverage command on each normal BAM
Creates .targetcoverage.cnn and .antitargetcoverage.cnn files
Parallelized across samples and within each coverage calculation
Reference construction:
Pools coverage across normal samples
Calculates GC content, RepeatMasker, edge effects from FASTA
Optionally clusters samples and creates sub-references
Outputs reference.cnn file
Flat reference: If no normal BAMs provided, creates a uniform reference assuming equal coverage in all bins. Less accurate but useful when no normals are available.
Protocol differences:
hybrid: Uses both targets and antitargets, edge correction enabled
wgs: Large genome-wide bins, no antitargets, no edge correction
amplicon: Uses only targets, no antitargets, no edge correction
- Raises:
ValueError – If method=’wgs’ or ‘amplicon’ but antitargets are specified. If method=’wgs’ and neither targets, access, nor fasta provided. If method=’wgs’ and targets != access (when both given).
See also
do_referenceCreates reference from coverage files
do_reference_flatCreates flat reference without normal samples
batch_write_coverageCalculates coverage for a single BAM
autobinAutomatically determines optimal bin size
Examples
Hybrid capture with normal samples: >>> ref, tgt, anti = batch_make_reference( … normal_bams=[‘N1.bam’, ‘N2.bam’, ‘N3.bam’], … target_bed=’baits.bed’, … antitarget_bed=None, # Auto-generate … is_haploid_x_reference=False, … diploid_parx_genome=’hg38’, … fasta=’hg38.fa’, … annotate=’refFlat.txt’, … short_names=True, … target_avg_size=0, # Auto-determine … access_bed=None, … antitarget_avg_size=None, … antitarget_min_size=None, … output_reference=None, … output_dir=’output/’, … processes=8, … by_count=False, … method=’hybrid’, … do_cluster=False … )
WGS without normal samples (flat reference): >>> ref, tgt, anti = batch_make_reference( … normal_bams=[], … target_bed=None, # Auto-generate from FASTA … antitarget_bed=None, … is_haploid_x_reference=True, … diploid_parx_genome=’hg19’, … fasta=’hg19.fa’, … annotate=’refFlat.txt’, … short_names=False, … target_avg_size=5000, … access_bed=’access-5k.hg19.bed’, … antitarget_avg_size=None, … antitarget_min_size=None, … output_reference=’flat_reference.cnn’, … output_dir=’wgs_ref/’, … processes=1, … by_count=False, … method=’wgs’, … do_cluster=False … )
- cnvlib.batch.batch_write_coverage(bed_fname: str, bam_fname: str, out_fname: str, by_count: bool, processes: int, fasta: str) str[source]¶
Run coverage on one sample, write to file.
- cnvlib.batch.batch_run_sample(bam_fname: str, target_bed: str, antitarget_bed: str, ref_fname: str, output_dir: str, is_haploid_x_reference: bool, diploid_parx_genome: str | None, plot_scatter: bool, plot_diagram: bool, rscript_path: str, by_count: bool, skip_low: bool, seq_method: str, segment_method: str, processes: int, do_cluster: bool, fasta: None = None) None[source]¶
Run the pipeline on one BAM file.
call¶
Call copy number variants from segmented log2 ratios.
- cnvlib.call.do_call(cnarr: CopyNumArray, variants: VariantArray | None = None, method: str = 'threshold', ploidy: int = 2, purity: float | None = None, is_haploid_x_reference: bool = False, is_sample_female: bool = False, diploid_parx_genome: str | None = None, filters: list[str] | None = None, thresholds: tuple[float, float, float, float] | ndarray = (-1.1, -0.25, 0.2, 0.7)) CopyNumArray[source]¶
Assign absolute integer copy number to each segment.
This is the main API function for calling absolute copy numbers from segmented log2 ratios. It supports multiple calling methods and can optionally incorporate variant allele frequencies and tumor purity information.
- Parameters:
cnarr (CopyNumArray) – Segmented copy number data (.cns file), typically from the segment command. Should contain ‘log2’ column with copy ratio values.
variants (VariantArray, optional) – Variant allele frequency data from VCF, used to call allele-specific copy numbers. If provided, ‘baf’ (B-allele frequency) will be added to the output, and ‘cn1’/’cn2’ (major/minor allele copy numbers) will be calculated.
method (str, optional) – Calling method to use. Options: - ‘threshold’: Apply log2 ratio thresholds (default) - ‘clonal’: Assume pure/clonal sample, infer from ploidy - ‘none’: Skip copy number calling, only apply filters Default: ‘threshold’
ploidy (int, optional) – Expected baseline ploidy of the sample. Usually 2 for diploid. Default: 2
purity (float, optional) – Estimated tumor purity (0.0 to 1.0). If provided and < 1.0, log2 ratios will be rescaled to account for normal cell contamination. Default: None (assume pure sample)
is_haploid_x_reference (bool, optional) – Whether the reference sample is male (haploid X chromosome). Affects X chromosome copy number interpretation. Default: False
is_sample_female (bool, optional) – Whether the test sample is female. Used with purity rescaling. Default: False
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). Treats PAR regions as diploid even in male samples. Default: None
filters (list of str, optional) – Segment filters to apply. Options include ‘ci’ (confidence interval), ‘sem’ (standard error), ‘ampdel’ (amplification/deletion), etc. See segfilters module for full list. Default: None
thresholds (tuple of 4 floats or ndarray, optional) – Log2 ratio thresholds for calling copy numbers when method=’threshold’. Format: (del_threshold, loss_threshold, gain_threshold, amp_threshold) These map to copy numbers [0, 1, 2, 3, 4+] respectively. Default: (-1.1, -0.25, 0.2, 0.7)
- Returns:
Copy of input array with added ‘cn’ column (absolute copy number). If variants provided, also includes: - ‘baf’: B-allele frequency per segment - ‘cn1’: Major allele copy number (≥ cn2) - ‘cn2’: Minor allele copy number (≤ cn1)
- Return type:
- Raises:
ValueError – If method is not one of ‘threshold’, ‘clonal’, or ‘none’.
See also
absolute_clonalCalculate absolute copy numbers for pure samples
absolute_thresholdApply log2 thresholds to call copy numbers
rescale_bafRescale B-allele frequencies for tumor purity
Examples
Basic threshold calling: >>> calls = do_call(segments, method=’threshold’)
With tumor purity and variants: >>> calls = do_call(segments, variants=vcf, purity=0.7, ploidy=2)
With custom thresholds: >>> calls = do_call(segments, thresholds=(-1.5, -0.3, 0.3, 1.0))
- cnvlib.call.log2_ratios(cnarr: CopyNumArray, absolutes: Series, ploidy: int, is_haploid_x_reference: bool, diploid_parx_genome: str | None, min_abs_val: float = 0.001, round_to_int: bool = False) Series[source]¶
Convert absolute copy numbers to log2 ratios.
Optionally round copy numbers to integers.
Account for reference sex & ploidy of sex chromosomes.
- cnvlib.call.absolute_threshold(cnarr: CopyNumArray, ploidy: int, thresholds: tuple[float, float, float, float] | ndarray, is_haploid_x_reference: bool) ndarray[source]¶
Call integer copy number using hard thresholds for each level.
Integer values are assigned for log2 ratio values less than each given threshold value in sequence, counting up from zero. Above the last threshold value, integer copy numbers are called assuming full purity, diploidy, and rounding up.
Default thresholds follow this heuristic for calling CNAs in a tumor sample: For single-copy gains and losses, assume 50% tumor cell clonality (including normal cell contamination). Then:
R> log2(2:6 / 4) -1.0 -0.4150375 0.0 0.3219281 0.5849625
Allowing for random noise of +/- 0.1, the cutoffs are:
DEL(0) < -1.1 LOSS(1) < -0.25 GAIN(3) >= +0.2 AMP(4) >= +0.7
For germline samples, better precision could be achieved with:
LOSS(1) < -0.4 GAIN(3) >= +0.3
- cnvlib.call.absolute_clonal(cnarr: CopyNumArray, ploidy: int, purity: float, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool) Series[source]¶
Calculate absolute copy number values from segment or bin log2 ratios.
- cnvlib.call.absolute_pure(cnarr: CopyNumArray, ploidy: int, is_haploid_x_reference: bool) ndarray[source]¶
Calculate absolute copy number values from segment or bin log2 ratios.
- cnvlib.call.absolute_dataframe(cnarr: CopyNumArray, ploidy: int, purity: float, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool) DataFrame[source]¶
Absolute, expected and reference copy number in a DataFrame.
- cnvlib.call.absolute_expect(cnarr: CopyNumArray, ploidy: int, diploid_parx_genome: str | None, is_sample_female: bool) Series[source]¶
Absolute integer number of expected copies in each bin.
I.e. the given ploidy for autosomes, and XY or XX sex chromosome counts according to the sample’s specified chromosomal sex.
- cnvlib.call.absolute_reference(cnarr: CopyNumArray, ploidy: int, diploid_parx_genome: str | None, is_haploid_x_reference: bool) Series[source]¶
Absolute integer number of reference copies in each bin.
I.e. the given ploidy for autosomes, 1 or 2 X according to the reference sex, and always 1 copy of Y.
- cnvlib.call.get_as_dframe_and_set_reference_and_expect_copies(cnarr: CopyNumArray, ploidy: int, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool) DataFrame[source]¶
Determine the number copies of a chromosome expected and in reference.
For sex chromosomes, these values may not be the same ploidy as the autosomes. The “reference” number is the chromosome’s ploidy in the CNVkit reference, while “expect” is the chromosome’s neutral ploidy in the given sample, based on the specified sex of each. E.g., given a female sample and a male reference, on chromosome X the “reference” value is 1 but “expect” is 2. Note that the “reference” value for chromosome Y is always 1 (better: ploidy / 2, see implementation below) to avoid divide-by-zero problems. The default reference is thus XXY (i.e. Klinefelter syndrome).
- Returns:
A pair of integers: number of copies in the reference, and expected in the sample.
- Return type:
coverage¶
Supporting functions for the ‘antitarget’ command.
- cnvlib.coverage.is_bedgraph_format(fname: str) bool[source]¶
Check if input file is bedGraph format (.bed.gz).
- cnvlib.coverage.validate_bedgraph_index(fname: str) None[source]¶
Ensure bedGraph file has a tabix index.
- Raises:
FileNotFoundError – If the index file (.tbi or .csi) is not found.
- cnvlib.coverage.bedgraph_to_basecount(bed_fname: str, bedgraph_fname: str) DataFrame[source]¶
Calculate coverage in BED regions from a bedGraph file.
Reads a tabix-indexed bedGraph file and calculates the total depth (sum of depth * bases) for each region in the BED file.
- Parameters:
- Returns:
DataFrame with columns matching bedcov output: chromosome, start, end, gene (if present in BED), and basecount.
- Return type:
pd.DataFrame
- cnvlib.coverage.do_coverage(bed_fname: str, bam_or_bg_fname: str, by_count: bool = False, min_mapq: int = 0, processes: int = 1, fasta: str | None = None) CopyNumArray[source]¶
Calculate coverage in the given regions from BAM read depths.
- Parameters:
bed_fname (str) – Path to BED file defining regions to measure coverage.
bam_or_bg_fname (str) – Path to BAM file containing aligned reads or bedGraph file (.bed.gz).
by_count (bool, optional) – Calculate coverage by read count instead of read depth. Default is False. Ignored for bedGraph input.
min_mapq (int, optional) – Minimum mapping quality score to include a read. Default is 0. Ignored for bedGraph input.
processes (int, optional) – Number of parallel processes to use. Default is 1. Ignored for bedGraph input.
fasta (str, optional) – Path to reference genome FASTA file.
- Returns:
Coverage values for each region.
- Return type:
- Raises:
RuntimeError – If the BAM file is not sorted by coordinates.
FileNotFoundError – If bedGraph file lacks required tabix index.
- cnvlib.coverage.interval_coverages(bed_fname: str, bam_fname: str, by_count: bool, min_mapq: int, processes: int, fasta: str | None = None) CopyNumArray[source]¶
Calculate log2 coverages in the BAM file at each interval.
- cnvlib.coverage.interval_coverages_bedgraph(regions_bed_fname: str, bedgraph_fname: str) CopyNumArray[source]¶
Calculate log2 coverages from bedGraph file at each interval.
- Parameters:
- Returns:
Coverage values for each region.
- Return type:
- cnvlib.coverage.interval_coverages_count(bed_fname: str, bam_fname: str, min_mapq: int, procs: int = 1, fasta: None = None) Iterator[list[int | tuple[str, int, int, str, float, float]]][source]¶
Calculate log2 coverages in the BAM file at each interval.
- cnvlib.coverage.region_depth_count(bamfile: AlignmentFile, chrom: str, start: int, end: int, gene: str, min_mapq: int) tuple[int, tuple[str, int, int, str, float, float]][source]¶
Calculate depth of a region via pysam count.
i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.
Coordinates are 0-based, per pysam.
- cnvlib.coverage.interval_coverages_pileup(bed_fname: str, bam_fname: str, min_mapq: int, procs: int = 1, fasta: str | None = None) DataFrame[source]¶
Calculate log2 coverages in the BAM file at each interval.
- cnvlib.coverage.bedcov(bed_fname: str, bam_fname: str, min_mapq: int, fasta: str | None = None, max_depth: int | None = None) DataFrame[source]¶
Calculate depth of all regions in a BED file via samtools (pysam) bedcov.
i.e. mean pileup depth across each region.
- cnvlib.coverage.detect_bedcov_columns(text: str) list[str][source]¶
Determine which ‘bedcov’ output columns to keep.
bedcov outputs the input BED columns plus an appended numeric column (basecount). With some options (e.g. -d), an additional trailing numeric column may be appended; then basecount is the second-to-last column.
diagram¶
Chromosome diagram drawing functions.
This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)
- cnvlib.diagram.create_diagram(cnarr: CopyNumArray, segarr: CopyNumArray, threshold: float, min_probes: int, outfname: str, show_range: None = None, title: None = None, show_labels: bool = True) str[source]¶
Create the diagram.
- cnvlib.diagram.build_chrom_diagram(features: collections.defaultdict[str, list[tuple[int, int, int, None, colors.Color]]], chr_sizes: collections.OrderedDict, sample_id: str, title: None = None) Drawing[source]¶
Create a PDF of color-coded features on chromosomes.
export¶
Export CNVkit objects and files to other formats.
- cnvlib.export.merge_samples(filenames: list[str]) DataFrame[source]¶
Merge probe values from multiple samples into a 2D table (of sorts).
- Input:
dict of {sample ID: (probes, values)}
- Output:
list-of-tuples: (probe, log2 coverages…)
- cnvlib.export.fmt_jtv(sample_ids: list[str], table: DataFrame) tuple[list[str], map][source]¶
Format for Java TreeView.
- cnvlib.export.export_nexus_basic(cnarr: CopyNumArray) pd.DataFrame[source]¶
Biodiscovery Nexus Copy Number “basic” format.
Only represents one sample per file.
- cnvlib.export.export_nexus_ogt(cnarr: CopyNumArray, varr: VariantArray, min_weight: float = 0.0) pd.DataFrame[source]¶
Biodiscovery Nexus Copy Number “Custom-OGT” format.
To create the b-allele frequencies column, alterate allele frequencies from the VCF are aligned to the .cnr file bins. Bins that contain no variants are left blank; if a bin contains multiple variants, then the frequencies are all “mirrored” to be above or below .5 (majority rules), then the median of those values is taken.
- cnvlib.export.export_seg(sample_fnames: list[str], chrom_ids: bool = False) DataFrame[source]¶
SEG format for copy number segments.
Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.
- cnvlib.export.export_bed(segments: CopyNumArray, ploidy: int, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool, label: str, show: str) pd.DataFrame[source]¶
Convert a copy number array to a BED-like DataFrame.
For each region in each sample (possibly filtered according to show), the columns are:
reference sequence name
start (0-indexed)
end
sample name or given label
integer copy number
By default (show=”ploidy”), skip regions where copy number is the default ploidy, i.e. equal to 2 or the value set by –ploidy. If show=”variant”, skip regions where copy number is neutral, i.e. equal to the reference ploidy on autosomes, or half that on sex chromosomes.
- cnvlib.export.export_vcf(segments: CopyNumArray, ploidy: int, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool, sample_id: None = None, cnarr: None = None) tuple[str, str][source]¶
Convert segments to Variant Call Format.
For now, only 1 sample per VCF. (Overlapping CNVs seem tricky.)
- cnvlib.export.assign_ci_start_end(segarr, cnarr)[source]¶
Assign ci_start and ci_end fields to segments.
Values for each segment indicate the CI boundary points within that segment, i.e. the right CI boundary for the left-side breakpoint (segment start), and left CI boundary for the right-side breakpoint (segment end).
This is a little unintuitive because the CI refers to the breakpoint, not the segment, but we’re storing the info in an array of segments.
Calculation: Just use the boundaries of the bins left- and right-adjacent to each segment breakpoint.
- cnvlib.export.segments2vcf(segments: CopyNumArray, ploidy: int, is_haploid_x_reference: bool, diploid_parx_genome: str | None, is_sample_female: bool) Iterator[tuple[str, int, str, str, str, str, str, str, str, str]][source]¶
Convert copy number segments to VCF records.
- cnvlib.export.export_gistic_markers(cnr_fnames)[source]¶
Generate a GISTIC 2.0 “markers” file from a set of .cnr files.
GISTIC documentation:
ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm http://genepattern.broadinstitute.org/ftp/distribution/genepattern/modules_public_server_doc/GISTIC2.pdf http://gdac.broadinstitute.org/runs/analyses__2013_05_23/reports/cancer/KICH-TP/CopyNumber_Gistic2/nozzle.html
The markers file identifies the marker names and positions of the markers in the original dataset (before segmentation). It is a three column, tab-delimited file with an optional header. The column headers are:
Marker Name
Chromosome
Marker Position (in bases)
GISTIC also needs an accompanying SEG file generated from corresponding .cns files.
- cnvlib.export.export_theta(tumor_segs: CopyNumArray, normal_cn: CopyNumArray | None) pd.DataFrame[source]¶
Convert tumor segments and normal .cnr or reference .cnn to THetA input.
Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.
The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.
- THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount
where chromosome IDs (“chrm”) are integers 1 through 24.
- cnvlib.export.ref_means_nbins(tumor_segs: CopyNumArray, normal_cn: CopyNumArray | None) tuple[ndarray, pd.Series][source]¶
Extract segments’ reference mean log2 values and probe counts.
Code paths:
wt_mdn wt_old probes norm -> norm, nbins + * * - 0, wt_mdn - + + - 0, wt_old * probes - + - - 0, wt_old * size? - - + - 0, probes - - - - 0, size? + - + + norm, probes + - - + norm, bin counts - + + + norm, probes - + - + norm, bin counts - - + + norm, probes - - - + norm, bin counts
- cnvlib.export.theta_read_counts(log2_ratio: pd.Series | ndarray, nbins: pd.Series, avg_depth: int = 500, avg_bin_width: int = 200, read_len: int = 100) pd.Series[source]¶
Calculate segments’ read counts from log2-ratios.
- Math:
nbases = read_length * read_count
- and
nbases = bin_width * read_depth
- where
read_depth = read_depth_ratio * avg_depth
- So:
read_length * read_count = bin_width * read_depth read_count = bin_width * read_depth / read_length
- cnvlib.export.export_theta_snps(varr: VariantArray) Iterator[pd.DataFrame][source]¶
Generate THetA’s SNP per-allele read count “formatted.txt” files.
fix¶
Supporting functions for the ‘fix’ command.
- cnvlib.fix.do_fix(target_raw: CopyNumArray, antitarget_raw: CopyNumArray, reference: CopyNumArray, diploid_parx_genome: str | None = None, do_gc: bool = True, do_edge: bool = True, do_rmask: bool = True, do_cluster: bool = False, smoothing_window_fraction: None = None) CopyNumArray[source]¶
Normalize tumor/test sample coverage using a reference and correct biases.
This is the core normalization step in the CNVkit pipeline. It combines target and antitarget coverage data, applies the reference baseline, corrects for systematic biases (GC content, edge effects, RepeatMasker), and produces log2 copy ratios ready for segmentation.
The “fix” name refers to fixing/correcting coverage biases.
- Parameters:
target_raw (CopyNumArray) – Raw coverage data for on-target/baited regions from the test sample. Typically a .targetcoverage.cnn file from the coverage command.
antitarget_raw (CopyNumArray) – Raw coverage data for off-target/antitarget regions from the test sample. Typically a .antitargetcoverage.cnn file from the coverage command. Can be empty for amplicon sequencing.
reference (CopyNumArray) – Pooled reference created from normal samples using do_reference. Contains expected neutral coverage and bias correction features.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). Ensures PAR regions on X/Y chromosomes are treated as diploid during normalization. Default: None
do_gc (bool, optional) – Apply GC bias correction using GC content from reference. Corrects for systematic coverage variations due to GC%. Default: True
do_edge (bool, optional) – Apply edge effect correction for target bins. Corrects for reduced coverage near bait interval boundaries. Default: True
do_rmask (bool, optional) – Apply RepeatMasker correction for antitarget bins. Corrects for coverage variations in repetitive regions. Default: True
do_cluster (bool, optional) – If the reference contains multiple sub-clusters (from hierarchical clustering during reference creation), select the cluster with highest correlation to this sample. Default: False
smoothing_window_fraction (None, optional) – Reserved parameter for future smoothing functionality. Currently not implemented. Default: None
- Returns:
Normalized copy number ratios (.cnr file) with log2 ratios centered around 0 (neutral copy number). Ready for segmentation with the segment command.
- Return type:
Notes
The normalization process:
Bias correction (separately for targets and antitargets):
GC correction: Adjusts for GC content-dependent coverage bias
Edge correction: Adjusts for lower coverage near bait edges (targets only)
RepeatMasker correction: Adjusts for repeats (antitargets only)
Reference subtraction:
Subtracts reference log2 values from sample log2 values
If do_cluster=True, selects best-matching reference cluster
Weighting:
Applies statistical weights based on coverage variability
Down-weights unreliable bins
Centering:
Centers the genome-wide distribution around log2=0
Accounts for sex chromosomes and PAR regions
The output .cnr file contains one row per bin with columns:
chromosome, start, end, gene: Genomic location
log2: Copy ratio in log2 scale (0 = diploid, +1 = gain, -1 = loss)
depth: Read depth
weight: Statistical weight for segmentation
See also
do_referenceCreates the reference used for normalization
load_adjust_coveragesInternal function that performs bias correction
apply_weightsCalculates statistical weights for bins
Examples
Basic usage: >>> fixed = do_fix( … target_raw=read_cna(‘Sample.targetcoverage.cnn’), … antitarget_raw=read_cna(‘Sample.antitargetcoverage.cnn’), … reference=read_cna(‘reference.cnn’) … )
With cluster selection: >>> fixed = do_fix( … target_raw, antitarget_raw, reference, … do_cluster=True # Select best-matching reference cluster … )
For male sample with PAR handling: >>> fixed = do_fix( … target_raw, antitarget_raw, reference, … diploid_parx_genome=’hg38’ … )
- cnvlib.fix.load_adjust_coverages(cnarr: CopyNumArray, ref_cnarr: CopyNumArray, skip_low: bool, fix_gc: bool, fix_edge: bool, fix_rmask: bool, diploid_parx_genome: str | None, smoothing_window_fraction: None = None) tuple[CopyNumArray, CopyNumArray][source]¶
Load and filter probe coverages; correct using reference and GC.
- cnvlib.fix.mask_bad_bins(cnarr: CopyNumArray) Series[source]¶
Flag the bins with excessively low or inconsistent coverage.
- Returns:
A boolean array where True indicates bins that failed the checks.
- Return type:
np.array
- cnvlib.fix.match_ref_to_sample(ref_cnarr: CopyNumArray, samp_cnarr: CopyNumArray) CopyNumArray[source]¶
Filter the reference bins to match the sample (target or antitarget).
- cnvlib.fix.center_by_window(cnarr: CopyNumArray, fraction: float, sort_key: Series | ndarray) CopyNumArray[source]¶
Smooth out biases according to the trait specified by sort_key.
E.g. correct GC-biased bins by windowed averaging across similar-GC bins; or for similar interval sizes.
- cnvlib.fix.get_edge_bias(cnarr: CopyNumArray, margin: int) Series[source]¶
Quantify the “edge effect” of the target tile and its neighbors.
The result is proportional to the change in the target’s coverage due to these edge effects, i.e. the expected loss of coverage near the target edges and, if there are close neighboring tiles, gain of coverage due to “spill over” reads from the neighbor tiles.
(This is not the actual change in coverage. This is just a tribute.)
- cnvlib.fix.edge_losses(target_sizes: ndarray, insert_size: int) ndarray[source]¶
Calculate coverage losses at the edges of baited regions.
Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:
\[i/2t\]If the “shoulders” extend outside the bait $(t < i), reduce by:
\[(i-t)^2 / 4it\]on each side, or (i-t)^2 / 2it total.
- cnvlib.fix.edge_gains(target_sizes: ndarray, gap_sizes: ndarray, insert_size: int) ndarray[source]¶
Calculate coverage gain from neighboring baits’ flanking reads.
Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:
.. math :: (i-g)^2 / 4it
If the neighbor flank extends beyond the target (t+g < i), reduce by:
.. math :: (i-t-g)^2 / 4it
If a neighbor overlaps the target, treat it as adjacent (gap size 0).
- cnvlib.fix.apply_weights(cnarr: CopyNumArray, ref_matched: CopyNumArray, log2_key: str, spread_key: str, epsilon: float = 0.0001) CopyNumArray[source]¶
Calculate weights for each bin.
Bin weight is an estimate of (1 - variance) and within the range
(0, 1].Weights are derived from:
Each bin’s size
Sample’s genome-wide average (on/off-target) coverage depth
Sample’s genome-wide observed (on/off-target) bin variances
And with a pooled reference:
Each bin’s coverage depth in the reference
The “spread” column of the reference (approx. stdev)
These estimates of variance assume the number of aligned reads per bin follows a Poisson distribution, approximately log-normal.
- Parameters:
cnarr (CopyNumArray) – Sample bins.
ref_match (CopyNumArray) – Reference bins.
log2_key (string) – The ‘log2’ column name in the reference to use. A clustered reference may have a suffix indicating the cluster, e.g. “log2_1”.
spread_key (string) – The ‘spread’ or ‘spread_<cluster_id>’ column name to use.
epsilon (float) – Minimum value for bin weights, to avoid 0-weight bins causing errors later during segmentation. (CBS doesn’t allow 0-weight bins.)
Returns (The input cnarr with a weight column added.)
heatmap¶
The ‘heatmap’ command.
- cnvlib.heatmap.do_heatmap(cnarrs, show_range=None, do_desaturate=False, by_bin=False, delim_sampl=False, vertical=False, title=None, ax=None)[source]¶
Plot copy number for multiple samples as a heatmap.
- Parameters:
cnarrs (list of CopyNumArray) – Copy number data for multiple samples.
show_range (str, optional) – Genomic range to display (e.g., “chr1:1000000-2000000”).
do_desaturate (bool, optional) – Desaturate colors in heatmap. Default is False.
by_bin (bool, optional) – Plot by bin position rather than genomic position. Default is False.
delim_sampl (bool, optional) – Draw delimiter lines between samples. Default is False.
vertical (bool, optional) – Orient heatmap vertically instead of horizontally. Default is False.
title (str, optional) – Title for the plot.
ax (matplotlib.axes.Axes, optional) – Existing axes to plot on. If None, creates new figure.
- Returns:
The axes object with the heatmap plot.
- Return type:
importers¶
Import from other formats to the CNVkit format.
- cnvlib.importers.do_import_picard(fname, too_many_no_coverage=100)[source]¶
Import coverage data from Picard CalculateHsMetrics output.
Converts Picard CalculateHsMetrics output to CNVkit format by reading the interval-level coverage data, cleaning up gene names, and calculating log2 ratios.
- Parameters:
- Returns:
CNVkit-compatible genomic array with columns including ‘gene’, ‘log2’, and ‘ratio’ (coverage relative to average).
- Return type:
Notes
Bins with zero coverage are assigned the NULL_LOG2_COVERAGE value to avoid math domain errors in log2 transformation.
Gene names from overlapping intervals are deduplicated (see unpipe_name).
See also
unpipe_nameCleans up duplicated gene names from Picard output
- cnvlib.importers.unpipe_name(name)[source]¶
Fix the duplicated gene names Picard spits out.
Return a string containing the single gene name, sans duplications and pipe characters.
Picard CalculateHsMetrics combines the labels of overlapping intervals by joining all labels with ‘|’, e.g. ‘BRAF|BRAF’ – no two distinct targeted genes actually overlap, though, so these dupes are redundant. Meaningless target names are dropped, e.g. ‘CGH|FOO|-’ resolves as ‘FOO’. In case of ambiguity, the longest name is taken, e.g. “TERT|TERT Promoter” resolves as “TERT Promoter”.
- cnvlib.importers.do_import_theta(segarr: CopyNumArray, theta_results_fname: str, ploidy: int = 2) Iterator[CopyNumArray][source]¶
Import absolute copy number calls from THetA results.
THetA (Tumor Heterogeneity Analysis) infers tumor purity and clonal/subclonal copy number profiles. This function converts THetA’s output into CNVkit segment format with absolute copy numbers.
- Parameters:
segarr (CopyNumArray) – Input segmented copy number data (.cns file). Typically the segmentation results before calling absolute copy numbers.
theta_results_fname (str) – Path to THetA results file (.results or .withBounds file).
ploidy (int, optional) – Expected baseline ploidy (usually 2 for diploid). Used to calculate log2 ratios from absolute copy numbers. Default: 2.
- Yields:
CopyNumArray – One or more segment arrays with ‘cn’ (absolute copy number) and recalculated ‘log2’ values. THetA may output multiple solutions, each yielded separately.
Notes
Only autosomal segments are processed; sex chromosomes are excluded because THetA doesn’t handle them well.
Segments with unknown copy number (marked as None/X in THetA output) are dropped.
Copy number 0 is treated as 0.5 for log2 calculation to avoid -inf.
See also
parse_theta_resultsParses the THetA results file format
import_rna¶
The ‘import-rna’ command.
- cnvlib.import_rna.do_import_rna(gene_count_fnames, in_format, gene_resource_fname, correlations_fname=None, normal_fnames=(), do_gc=True, do_txlen=True, max_log2=3, diploid_parx_genome=None)[source]¶
Convert a cohort of per-gene read counts to CNVkit .cnr format.
The expected data source is TCGA gene-level expression counts for individual samples, but other sources should be fine, too.
- Parameters:
gene_count_fnames (list of str) – Paths to gene count files for all samples.
in_format (str) – Input file format: ‘rsem’ or ‘counts’.
gene_resource_fname (str) – Path to gene metadata/resource file.
correlations_fname (str, optional) – Path to TCGA gene expression/CNV correlation profiles.
normal_fnames (tuple of str, optional) – Paths to normal/control sample count files.
do_gc (bool, optional) – Apply GC bias correction. Default is True.
do_txlen (bool, optional) – Apply transcript length correction. Default is True.
max_log2 (float, optional) – Maximum log2 ratio to cap values at. Default is 3.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’).
- Returns:
(all_data DataFrame, cnrs generator) - all_data: Summary table with gene info and log2-normalized values - cnrs: Generator of CopyNumArray objects, one per sample
- Return type:
- Raises:
RuntimeError – If input format is not recognized.
- cnvlib.import_rna.aggregate_rsem(fnames)[source]¶
Pull out the expected read counts from each RSEM file.
The format of RSEM’s
*_rsem.genes.resultsoutput files is tab-delimited with a header row. We extract the Ensembl gene ID, expected read counts, and transcript lengths from each file.- Returns:
sample_counts (DataFrame) – Row index is Ensembl gene ID, column index is filename.
tx_lengths (Series) – Gene lengths.
metrics¶
Robust metrics to evaluate performance of copy number estimates.
- cnvlib.metrics.do_metrics(cnarrs: CopyNumArray, segments: CopyNumArray | None = None, skip_low: bool = False) pd.DataFrame[source]¶
Compute coverage deviations and other metrics for self-evaluation.
- Parameters:
cnarrs (CopyNumArray or list of CopyNumArray) – Bin-level copy number data for one or more samples.
segments (CopyNumArray, list of CopyNumArray, or None, optional) – Segmented copy number data. If None, computes metrics without segment-based residuals.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
- Returns:
Metrics table with columns: sample, segments, stdev, mad, iqr, bivar. Each row contains quality metrics for one sample.
- Return type:
pd.DataFrame
- cnvlib.metrics.zip_repeater(iterable: Iterator[Any], repeatable: list[CopyNumArray]) Iterator[tuple[CopyNumArray, CopyNumArray]][source]¶
Repeat a single segmentation to match the number of copy ratio inputs
reference¶
Supporting functions for the ‘reference’ command.
- cnvlib.reference.do_reference_flat(targets: str, antitargets: str | None = None, fa_fname: None = None, is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None) CopyNumArray[source]¶
Compile a neutral-coverage reference from the given intervals.
Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.
- Parameters:
targets (str) – Path to BED file with target intervals.
antitargets (str, optional) – Path to BED file with antitarget intervals.
fa_fname (str, optional) – Path to reference genome FASTA file for GC/RepeatMasker calculations.
is_haploid_x_reference (bool, optional) – True if reference should use male (haploid X) chromosome coverage. Default is False.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
- Returns:
Neutral reference with flat (expected neutral) log2 coverage values and calculated GC/RepeatMasker features.
- Return type:
- cnvlib.reference.bed2probes(bed_fname: str) CopyNumArray[source]¶
Create a neutral-coverage CopyNumArray from a file of regions.
- cnvlib.reference.do_reference(target_fnames: list[str], antitarget_fnames: list[str] | None = None, fa_fname: str | None = None, is_haploid_x_reference: bool = False, diploid_parx_genome: str | None = None, female_samples: bool | None = None, do_gc: bool = True, do_edge: bool = True, do_rmask: bool = True, do_cluster: bool = False, min_cluster_size: int = 4) CopyNumArray[source]¶
Compile a pooled coverage reference from normal samples.
Creates a reference profile by combining coverage data from multiple normal (control) samples. The reference represents the expected neutral copy number baseline and is used to normalize tumor/test sample coverage for CNV detection.
The reference includes per-bin coverage values, GC content, RepeatMasker information, and other features used for bias correction. Sample sex is either provided or automatically inferred from X/Y chromosome coverage.
- Parameters:
target_fnames (list of str) – Paths to target coverage files (.targetcoverage.cnn or .cnn) from normal samples. These are typically generated by the coverage command on on-target/baited regions.
antitarget_fnames (list of str, optional) – Paths to antitarget coverage files (.antitargetcoverage.cnn) from the same normal samples. Must match target_fnames in order and count. If None, only targets are used (e.g., for amplicon sequencing). Default: None
fa_fname (str, optional) – Path to reference genome FASTA file. Required for GC content and RepeatMasker calculations. If None, GC and RM corrections are skipped. Default: None
is_haploid_x_reference (bool, optional) – True if the reference samples are male (haploid X chromosome). Affects expected X chromosome coverage normalization. Default: False (assumes female reference or mixed)
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’). PAR regions on X/Y are treated as diploid even in male samples. Default: None
female_samples (bool, optional) – Explicit sex designation for all samples. If None, sex is inferred from X/Y chromosome coverage patterns automatically. True = all female, False = all male. Default: None (auto-infer)
do_gc (bool, optional) – Calculate and include GC content for each bin from FASTA. Requires fa_fname. Used for GC bias correction. Default: True
do_edge (bool, optional) – Calculate and include “edge effect” metric (bin distance from interval edges). Used for edge bias correction. Default: True
do_rmask (bool, optional) – Calculate and include RepeatMasker content for each bin from FASTA. Requires fa_fname. Used for repeat region bias correction. Default: True
do_cluster (bool, optional) – Apply hierarchical clustering to identify and exclude outlier samples. Helps create a more robust reference when sample quality varies. Default: False
min_cluster_size (int, optional) – Minimum cluster size when do_cluster=True. Smaller clusters are considered outliers and excluded. Default: 4
- Returns:
Pooled reference with combined coverage values, GC content, RepeatMasker content, and other features. This .cnn file is used as input to the fix command for normalizing tumor samples.
- Return type:
Notes
The reference creation process:
Sex inference (if not provided): Examines X/Y chromosome coverage ratios in each sample to determine chromosomal sex.
Coverage pooling: Combines coverage across samples using robust statistics (typically median).
Bias feature calculation: Extracts GC%, RepeatMasker%, edge effects from the reference genome FASTA.
Optional clustering: Identifies and removes outlier samples.
The resulting reference should be reused for all samples prepared with the same capture kit, sequencing platform, and protocol.
Warnings are issued for bins with unusually low or zero coverage, which may indicate problematic baited regions.
See also
combine_probesInternal function that performs the actual reference creation
infer_sexesInfers sample sex from X/Y chromosome coverage
warn_bad_binsWarns about bins with problematic coverage
Examples
Basic reference from normal samples: >>> ref = do_reference( … [‘N1.targetcoverage.cnn’, ‘N2.targetcoverage.cnn’], … [‘N1.antitargetcoverage.cnn’, ‘N2.antitargetcoverage.cnn’], … fa_fname=’hg19.fa’ … )
Reference without FASTA (no GC/RM correction): >>> ref = do_reference( … [‘N1.targetcoverage.cnn’, ‘N2.targetcoverage.cnn’], … [‘N1.antitargetcoverage.cnn’, ‘N2.antitargetcoverage.cnn’] … )
Reference with explicit male samples: >>> ref = do_reference( … target_fnames=[‘M1.cnn’, ‘M2.cnn’], … fa_fname=’hg38.fa’, … female_samples=False, … diploid_parx_genome=’hg38’ … )
- cnvlib.reference.infer_sexes(cnn_fnames: list[str], is_haploid_x: bool, diploid_parx_genome: str | None) dict[str, bool_][source]¶
Map sample IDs to inferred chromosomal sex, where possible.
For samples where the source file is empty or does not include either sex chromosome, that sample ID will not be in the returned dictionary.
- cnvlib.reference.combine_probes(filenames: list[str], antitarget_fnames: list[str] | None, fa_fname: str | None, is_haploid_x: bool, diploid_parx_genome: str | None, sexes: dict[str, bool_ | bool], fix_gc: bool, fix_edge: bool, fix_rmask: bool, do_cluster: bool, min_cluster_size: int) CNA[source]¶
Calculate the median coverage of each bin across multiple samples.
- Parameters:
filenames (list) – List of string filenames, corresponding to targetcoverage.cnn and antitargetcoverage.cnn files, as generated by ‘coverage’ or ‘import-picard’.
fa_fname (str) – Reference genome sequence in FASTA format, used to extract GC and RepeatMasker content of each genomic bin.
is_haploid_x (bool)
do_cluster (bool)
fix_gc (bool)
fix_edge (bool)
fix_rmask (bool)
- Returns:
One object summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and GC content.
- Return type:
- cnvlib.reference.load_sample_block(filenames: list[str], fa_fname: str | None, is_haploid_x: bool, diploid_parx_genome: str | None, sexes: dict[str, bool_ | bool], skip_low: bool, fix_gc: bool, fix_edge: bool, fix_rmask: bool) tuple[pd.DataFrame, ndarray, ndarray][source]¶
Load and summarize a pool of *coverage.cnn files.
Run separately for the on-target and (optional) antitarget bins.
- Returns:
ref_df (pandas.DataFrame) – All columns needed for the reference CNA object, including aggregate log2 and spread.
all_logr (numpy.ndarray) – All sample log2 ratios, as a 2D matrix (rows=bins, columns=samples), to be used with do_cluster.
- cnvlib.reference.bias_correct_logr(cnarr: CNA, ref_columns: dict[str, pd.Series | ndarray], ref_edge_bias: pd.Series, ref_flat_logr: ndarray, sexes: dict[str, bool_ | bool], is_chr_x: pd.Series, is_chr_y: pd.Series, fix_gc: bool, fix_edge: bool, fix_rmask: bool, skip_low: bool, diploid_parx_genome: str | None) pd.Series[source]¶
Perform bias corrections on the sample.
- cnvlib.reference.shift_sex_chroms(cnarr: CNA, sexes: dict[str, bool_ | bool], ref_flat_logr: ndarray, is_chr_x: pd.Series, is_chr_y: pd.Series) None[source]¶
Shift sample X and Y chromosomes to match the reference sex.
Reference values:
XY: chrX -1, chrY -1 XX: chrX 0, chrY -1
Plan:
chrX: xx sample, xx ref: 0 (from 0) xx sample, xy ref: -= 1 (from -1) xy sample, xx ref: += 1 (from 0) +1 xy sample, xy ref: 0 (from -1) +1 chrY: xx sample, xx ref: = -1 (from -1) xx sample, xy ref: = -1 (from -1) xy sample, xx ref: 0 (from -1) +1 xy sample, xy ref: 0 (from -1) +1
- cnvlib.reference.summarize_info(all_logr: ndarray, all_depths: ndarray) dict[str, ndarray][source]¶
Average & spread of log2ratios and depths for a group of samples.
Can apply to all samples, or a given cluster of samples.
- cnvlib.reference.create_clusters(logr_matrix, min_cluster_size, sample_ids)[source]¶
Extract and summarize clusters of samples in logr_matrix.
Calculate correlation coefficients between all samples (columns).
Cluster the correlation matrix.
For each resulting sample cluster (down to a minimum size threshold), calculate the central log2 value for each bin, similar to the full pool. Also print the sample IDs in each cluster, if feasible.
Also recalculate and store the ‘spread’ of each cluster, though this might not be necessary/good.
Return a DataFrame of just the log2 values. Column names are
log2_iwhere i=1,2,… .
- cnvlib.reference.warn_bad_bins(cnarr: CopyNumArray, max_name_width: int = 50) None[source]¶
Warn about target bins where coverage is poor.
Prints a formatted table to stderr.
- cnvlib.reference.get_fasta_stats(cnarr: CNA, fa_fname: str) tuple[ndarray, ndarray][source]¶
Calculate GC and RepeatMasker content of each bin in the FASTA genome.
- cnvlib.reference.calculate_gc_lo(subseq: str) tuple[float, float][source]¶
Calculate the GC and lowercase (RepeatMasked) content of a string.
- cnvlib.reference.fasta_extract_regions(fa_fname: str, intervals: CNA) Iterator[str][source]¶
Extract an iterable of regions from an indexed FASTA file.
Input: FASTA file name; iterable of (seq_id, start, end) (1-based) Output: iterable of string sequences.
- cnvlib.reference.reference2regions(refarr: CopyNumArray) tuple[GenomicArray, GenomicArray][source]¶
Split reference into target and antitarget regions.
reports¶
Supports the sub-commands breaks and genemetrics.
Supporting functions for the text/tabular-reporting commands.
Namely: breaks, genemetrics.
- cnvlib.reports.do_breaks(probes: CopyNumArray, segments: CopyNumArray, min_probes: int = 1) pd.DataFrame[source]¶
List the targeted genes in which a copy number breakpoint occurs.
- Parameters:
probes (CopyNumArray) – Bin-level copy number data.
segments (CopyNumArray) – Segmented copy number data.
min_probes (int, optional) – Minimum number of probes required on each side of the breakpoint. Default is 1.
- Returns:
Table with columns: gene, chromosome, location, change, probes_left, probes_right.
- Return type:
pd.DataFrame
- cnvlib.reports.get_gene_intervals(all_probes: CopyNumArray, ignore: tuple[str, str, str] = ('-', '.', 'CGH')) collections.defaultdict[str, list[tuple[str, list[int], int]]][source]¶
Tally genomic locations of each targeted gene.
Return a dict of chromosomes to a list of tuples: (gene name, starts, end), where gene name is a string, starts is a sorted list of probe start positions, and end is the last probe’s end position as an integer. (The endpoints are redundant since probes are adjacent.)
- cnvlib.reports.get_breakpoints(intervals: collections.defaultdict[str, list[tuple[str, list[int], int]]], segments: CopyNumArray, min_probes: int) list[tuple[str, str, int, float64, int, int]][source]¶
Identify segment breaks within the targeted intervals.
- cnvlib.reports.do_genemetrics(cnarr: CopyNumArray, segments: CopyNumArray | None = None, threshold: float = 0.2, min_probes: int = 3, skip_low: bool = False, is_haploid_x_reference: bool = False, is_sample_female: None = None, diploid_parx_genome: str | None = None) pd.DataFrame[source]¶
Identify targeted genes with copy number gain or loss.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data.
segments (CopyNumArray, optional) – Segmented copy number data. If provided, metrics are calculated per segment.
threshold (float, optional) – Minimum absolute log2 ratio to consider a gene altered. Default is 0.2.
min_probes (int, optional) – Minimum number of probes required to report a gene. Default is 3.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
is_haploid_x_reference (bool, optional) – Whether reference is male (haploid X). Default is False.
is_sample_female (bool, optional) – Whether sample is female. If None, inferred from data.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
- Returns:
Table of genes with copy number alterations, including gene name, chromosome, log2 ratio, and probe counts.
- Return type:
pd.DataFrame
- cnvlib.reports.gene_metrics_by_gene(cnarr: CopyNumArray, threshold: float, skip_low: bool = False) Iterator[pd.Series][source]¶
Identify genes where average bin copy ratio value exceeds threshold.
NB: Adjust the sample’s sex-chromosome log2 values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
- cnvlib.reports.gene_metrics_by_segment(cnarr: CopyNumArray, segments: CopyNumArray, threshold: float, skip_low: bool = False) Iterator[pd.Series][source]¶
Identify genes where segmented copy ratio exceeds threshold.
In the output table, show each segment’s weight and probes as segment_weight and segment_probes, alongside the gene-level weight and probes.
NB: Adjust the sample’s sex-chromosome log2 values beforehand with shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.
- cnvlib.reports.group_by_genes(cnarr: CopyNumArray, skip_low: bool) Iterator[pd.Series][source]¶
Group probe and coverage data by gene.
Return an iterable of genes, in chromosomal order, associated with their location and coverages:
[(gene, chrom, start, end, [coverages]), …]
scatter¶
The ‘scatter’ command for rendering copy number as scatter plots.
- cnvlib.scatter.do_scatter(cnarr: CopyNumArray, segments: CopyNumArray | None = None, variants: None = None, show_range: None = None, show_gene: None = None, do_trend: bool = False, by_bin: bool = False, window_width: float = 1000000.0, y_min: None = None, y_max: None = None, fig_size: None = None, antitarget_marker: None = None, segment_color: str = 'darkorange', title: None = None) Figure[source]¶
Plot probe log2 coverages and segmentation calls together.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data to plot.
segments (CopyNumArray, optional) – Segmented copy number data to overlay.
variants (VariantArray, optional) – Variant allele frequency data to plot.
show_range (str, optional) – Genomic range to display (e.g., “chr1:1000000-2000000”).
show_gene (str, optional) – Gene name to zoom into.
do_trend (bool, optional) – Plot a smoothed trendline. Default is False.
by_bin (bool, optional) – Plot by bin index rather than genomic coordinates. Default is False.
window_width (float, optional) – Smoothing window width in bases (or bins if by_bin=True). Default is 1e6.
y_min (float, optional) – Minimum y-axis value.
y_max (float, optional) – Maximum y-axis value.
fig_size (tuple of float, optional) – Figure size as (width, height) in inches.
antitarget_marker (str, optional) – Marker style for antitarget bins.
segment_color (str, optional) – Color for segment lines. Default is SEG_COLOR.
title (str, optional) – Plot title.
- Returns:
The scatter plot figure.
- Return type:
- cnvlib.scatter.genome_scatter(cnarr: CopyNumArray, segments: CopyNumArray | None = None, variants: None = None, do_trend: bool = False, y_min: None = None, y_max: None = None, title: None = None, segment_color: str = 'darkorange') Figure[source]¶
Plot all chromosomes, concatenated on one plot.
- cnvlib.scatter.cnv_on_genome(axis: Axes, probes: CopyNumArray, segments: CopyNumArray, do_trend: bool = False, y_min: None = None, y_max: None = None, segment_color: str = 'darkorange') Axes[source]¶
Plot bin ratios and/or segments for all chromosomes on one plot.
- cnvlib.scatter.snv_on_genome(axis, variants, chrom_sizes, segments, do_trend, segment_color)[source]¶
Plot a scatter-plot of SNP chromosomal positions and shifts.
- cnvlib.scatter.chromosome_scatter(cnarr, segments, variants, show_range, show_gene, antitarget_marker, do_trend, by_bin, window_width, y_min, y_max, title, segment_color)[source]¶
Plot a specified region on one chromosome.
Possibilities:
Options | Shown ------------ | -------- -c | -g | Genes | Region ------- | -- | ----- | ------ - | + | given | auto: gene(s) + margin chr | - | none | whole chrom chr | + | given | whole chrom chr:s-e | - | all | given chr:s-e | + | given | given
- cnvlib.scatter.select_range_genes(cnarr, segments, variants, show_range, show_gene, window_width)[source]¶
Determine which datapoints to show based on the given options.
Behaviors:
start/end show_gene + + given region + genes; err if any gene outside it - + window +/- around genes + - given region, highlighting any genes within it - - whole chromosome, no genes
If show_range is a chromosome name only, no start/end positions, then the whole chromosome will be shown.
If region start/end coordinates are given and show_gene is ‘’ or ‘,’ (or all commas, etc.), then instead of highlighting all genes in the selection, no genes will be highlighted.
- cnvlib.scatter.cnv_on_chromosome(axis, probes, segments, genes, antitarget_marker=None, do_trend=False, x_limits=None, y_min=None, y_max=None, segment_color='darkorange')[source]¶
Draw a scatter plot of probe values with optional segments overlaid.
- Parameters:
genes (list) – Of tuples: (start, end, gene name)
- cnvlib.scatter.snv_on_chromosome(axis, variants, segments, genes, do_trend, by_bin, segment_color)[source]¶
- cnvlib.scatter.set_xlim_from(axis, probes=None, segments=None, variants=None) None[source]¶
Configure axes for plotting a single chromosome’s data.
- Parameters:
probes (CopyNumArray)
segments (CopyNumArray)
variants (VariantArray) – All should already be subsetted to the region that will be plotted.
- cnvlib.scatter.setup_chromosome(axis, y_min=None, y_max=None, y_label=None) None[source]¶
Configure axes for plotting a single chromosome’s data.
- cnvlib.scatter.choose_segment_color(segment, highlight_color, default_bright=True)[source]¶
Choose a display color based on a segment’s CNA status.
Uses the fields added by the ‘call’ command. If these aren’t present, use highlight_color for everything.
For sex chromosomes, some single-copy deletions or gains might not be highlighted, since sample sex isn’t used to infer the neutral ploidies.
segmentation¶
Segmentation of copy number values.
- cnvlib.segmentation.do_segmentation(cnarr: CNA, method: str, diploid_parx_genome: str | None = None, threshold: float | None = None, variants: VariantArray | None = None, skip_low: bool = False, skip_outliers: int = 10, min_weight: int = 0, save_dataframe: bool = False, rscript_path: str = 'Rscript', processes: int = 1, smooth_cbs: bool = False) CNA[source]¶
Infer copy number segments from the given coverage table.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number ratios (.cnr file).
method (str) – Segmentation algorithm: ‘cbs’, ‘flasso’, ‘haar’, ‘hmm’, ‘hmm-tumor’, or ‘hmm-germline’.
diploid_parx_genome (str, optional) – Reference genome name for pseudo-autosomal region handling (e.g., ‘hg19’, ‘hg38’, ‘mm10’).
threshold (float, optional) – Significance threshold (for CBS/flasso/haar) or smoothing window size (for HMM methods). If None, uses method-specific defaults.
variants (VariantArray, optional) – Variant allele frequencies to incorporate into HMM segmentation.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
skip_outliers (int, optional) – Skip bins with log2 ratios more than this many standard deviations from the chromosome arm mean. Default is 10.
min_weight (int, optional) – Minimum weight threshold for including bins. Default is 0.
save_dataframe (bool, optional) – Return the R dataframe as a string along with segments. Default is False.
rscript_path (str, optional) – Path to Rscript executable for CBS/flasso methods. Default is “Rscript”.
processes (int, optional) – Number of parallel processes to use. Default is 1.
smooth_cbs (bool, optional) – Apply smoothing to CBS segmentation results. Default is False.
- Returns:
Segmented copy number data (.cns format). If save_dataframe=True, returns (segments, R dataframe string).
- Return type:
- Raises:
ValueError – If method is not one of the supported segmentation methods.
- cnvlib.segmentation.drop_outliers(cnarr: CopyNumArray, width: int, factor: int) CopyNumArray[source]¶
Drop outlier bins with log2 ratios too far from the trend line.
Outliers are the log2 values factor times the 90th quantile of absolute deviations from the rolling average, within a window of given width. The 90th quantile is about 1.97 standard deviations if the log2 values are Gaussian, so this is similar to calling outliers factor * 1.97 standard deviations from the rolling mean. For a window size of 50, the breakdown point is 2.5 outliers within a window, which is plenty robust for our needs.
- cnvlib.segmentation.transfer_fields(segments: CopyNumArray, cnarr: CopyNumArray, ignore: tuple[str, str, str] = ('-', '.', 'CGH')) CopyNumArray[source]¶
Map gene names, weights, depths from cnarr bins to segarr segments.
Segment gene name is the comma-separated list of bin gene names. Segment weight is the sum of bin weights, and depth is the (weighted) mean of bin depths.
Also: Post-process segmentation output.
Ensure every chromosome has at least one segment.
Ensure first and last segment ends match 1st/last bin ends (but keep log2 as-is).
segmetrics¶
Robust metrics to evaluate performance of copy number estimates.
- cnvlib.segmetrics.do_segmetrics(cnarr: CopyNumArray, segarr: CopyNumArray, location_stats: tuple[()] | list[str] = (), spread_stats: tuple[()] | list[str] = (), interval_stats: tuple[()] | list[str] = (), alpha: float = 0.05, bootstraps: int = 100, smoothed: bool = False, skip_low: bool = False) CopyNumArray[source]¶
Compute segment-level metrics from bin-level log2 ratios.
- Parameters:
cnarr (CopyNumArray) – Bin-level copy number data.
segarr (CopyNumArray) – Segmented copy number data.
location_stats (list of str, optional) – Location statistics to compute: ‘mean’, ‘median’, ‘mode’, ‘p_ttest’. Default is empty tuple.
spread_stats (list of str, optional) – Spread statistics to compute: ‘stdev’, ‘mad’, ‘mse’, ‘iqr’, ‘bivar’, ‘sem’. Default is empty tuple.
interval_stats (list of str, optional) – Interval statistics to compute: ‘ci’ (confidence interval), ‘pi’ (prediction interval). Default is empty tuple.
alpha (float, optional) – Significance level for confidence/prediction intervals. Default is 0.05.
bootstraps (int, optional) – Number of bootstrap iterations for confidence intervals. Default is 100.
smoothed (bool, optional) – Use smoothed bootstrap for confidence intervals. Default is False.
skip_low (bool, optional) – Skip bins with low coverage. Default is False.
- Returns:
Segmented data with additional statistical columns.
- Return type:
- cnvlib.segmetrics.make_pi_func(alpha: float) Callable[source]¶
Prediction interval, estimated by percentiles.
- cnvlib.segmetrics.calc_intervals(bins_log2s: list[Series], weights: Series, func: Callable) tuple[ndarray, ndarray][source]¶
Compute a stat that yields intervals (low & high values)
- cnvlib.segmetrics.confidence_interval_bootstrap(values: ndarray, weights: ndarray, alpha: float, bootstraps: int = 100, smoothed: bool = False) ndarray[source]¶
Confidence interval for segment mean log2 value, estimated by bootstrap.
- cnvlib.segmetrics.segment_mean(cnarr: CopyNumArray, skip_low: bool = False) float64 | float[source]¶
Weighted average of bin log2 values.
target¶
Transform bait intervals into targets more suitable for CNVkit.
- cnvlib.target.do_target(bait_arr: GenomicArray, annotate: str | None = None, do_short_names: bool = False, do_split: bool = False, avg_size: int | float = 266.6666666666667) GenomicArray[source]¶
Transform bait intervals into targets more suitable for CNVkit.
- Parameters:
bait_arr (GenomicArray) – Bait intervals from a BED or interval file.
annotate (str, optional) – Path to annotation file (BED, GFF, refFlat, etc.) to assign gene names to target regions.
do_short_names (bool, optional) – Reduce multi-accession bait labels to be short and consistent. Default is False.
do_split (bool, optional) – Split large target intervals into smaller pieces. Default is False.
avg_size (float, optional) – Average target size when splitting large intervals. Default is 200/0.75 (~267 bp).
- Returns:
Processed target intervals ready for CNVkit analysis.
- Return type:
- cnvlib.target.shorten_labels(gene_labels: Series) Iterator[str][source]¶
Reduce multi-accession interval labels to the minimum consistent.
So: BED or interval_list files have a label for every region. We want this to be a short, unique string, like the gene name. But if an interval list is instead a series of accessions, including additional accessions for sub-regions of the gene, we can extract a single accession that covers the maximum number of consecutive regions that share this accession.
e.g.:
... mRNA|JX093079,ens|ENST00000342066,mRNA|JX093077,ref|SAMD11,mRNA|AF161376,mRNA|JX093104 ens|ENST00000483767,mRNA|AF161376,ccds|CCDS3.1,ref|NOC2L ...
becomes:
... mRNA|AF161376 mRNA|AF161376 ...
Helper modules¶
cmdutil¶
Functions reused within command-line implementations.
- cnvlib.cmdutil.read_cna(infile: str, sample_id: str | None = None, meta: None = None) CopyNumArray[source]¶
Read a CNVkit file (.cnn, .cnr, .cns) to create a CopyNumArray object.
- cnvlib.cmdutil.read_ga(infile: str, sample_id: None = None, meta: None = None) GenomicArray[source]¶
Read a CNVkit file (.cnn, .cnr, .cns) to create a GenomicArray (!) object.
- cnvlib.cmdutil.load_het_snps(vcf_fname: str, sample_id: str | None = None, normal_id: str | None = None, min_variant_depth: int = 20, zygosity_freq: None = None, tumor_boost: bool = False) VariantArray[source]¶
- cnvlib.cmdutil.verify_sample_sex(cnarr, sex_arg, is_haploid_x_reference, diploid_parx_genome)[source]¶
- cnvlib.cmdutil.write_tsv(outfname, rows, colnames=None) None[source]¶
Write rows, with optional column header, to tabular file.
core¶
CNV utilities.
- cnvlib.core.call_quiet(*args)[source]¶
Safely run a command and get stdout; print stderr if there’s an error.
Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.
- cnvlib.core.ensure_path(fname: str) bool[source]¶
Create dirs and move an existing file to avoid overwriting, if necessary.
If a file already exists at the given path, it is renamed with an integer suffix to clear the way.
- cnvlib.core.temp_write_text(text, mode='w+b')[source]¶
Save text to a temporary file.
NB: This won’t work on Windows b/c the file stays open.
- cnvlib.core.assert_equal(msg: str, **values) None[source]¶
Evaluate and compare two or more values for equality.
Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.
Example:
>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy'])) ... ValueError: Mismatch: expected = 1, saw = 2
descriptives¶
Robust estimators of central tendency and scale.
- See:
https://en.wikipedia.org/wiki/Robust_measures_of_scale https://astropy.readthedocs.io/en/latest/_modules/astropy/stats/funcs.html
- cnvlib.descriptives.on_array(default: int | None = None) Callable[source]¶
Ensure a is a numpy array with no missing/NaN values.
- cnvlib.descriptives.on_weighted_array(default: None = None) Callable[source]¶
Ensure a and w are equal-length numpy arrays with no NaN values.
For weighted descriptives – a is the array of values, w is weights.
Drop any cells in a that are NaN from both a and w
Replace any remaining NaN cells in w with 0.
- cnvlib.descriptives.biweight_location(a: ndarray, initial: None = None, c: float = 6.0, epsilon: float = 0.001, max_iter: int = 5) float64[source]¶
Compute the biweight location for an array.
The biweight is a robust statistic for estimating the central location of a distribution.
- cnvlib.descriptives.modal_location(a: ndarray) float64[source]¶
Return the modal value of an array’s values.
The “mode” is the location of peak density among the values, estimated using a Gaussian kernel density estimator.
- Parameters:
a (np.array) – A 1-D array of floating-point values, e.g. bin log2 ratio values.
- cnvlib.descriptives.weighted_median(a: ndarray, weights: ndarray) float64[source]¶
Weighted median of a 1-D numeric array.
- cnvlib.descriptives.biweight_midvariance(a: ndarray, initial: int | float64 | None = None, c: float = 9.0, epsilon: float = 0.001) float64[source]¶
Compute the biweight midvariance for an array.
The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.
See:
- cnvlib.descriptives.gapper_scale(a)[source]¶
Scale estimator based on gaps between order statistics.
See:
Wainer & Thissen (1976)
Beers, Flynn, and Gebhardt (1990)
- cnvlib.descriptives.interquartile_range(a: ndarray) float64[source]¶
Compute the difference between the array’s first and third quartiles.
- cnvlib.descriptives.median_absolute_deviation(a: ndarray, scale_to_sd: bool = True) float64[source]¶
Compute the median absolute deviation (MAD) of array elements.
The MAD is defined as:
median(abs(a - median(a))).See: https://en.wikipedia.org/wiki/Median_absolute_deviation
- cnvlib.descriptives.weighted_mad(a, weights, scale_to_sd=True)[source]¶
Median absolute deviation (MAD) with weights.
- cnvlib.descriptives.weighted_std(a: ndarray, weights: ndarray) float64[source]¶
Standard deviation with weights.
- cnvlib.descriptives.mean_squared_error(a, initial=None)[source]¶
Mean squared error (MSE).
By default, assume the input array a is the residuals/deviations/error, so MSE is calculated from zero. Another reference point for calculating the error can be specified with initial.
- cnvlib.descriptives.q_n(a)[source]¶
Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.
Qn := Cn first quartile of (|x_i - x_j|: i < j)where Cn is a constant depending on n.
Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.
n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
parallel¶
Utilities for multi-core parallel processing.
- class cnvlib.parallel.SerialPool[source]¶
Bases:
objectMimic the concurrent.futures.Executor interface, but run in serial.
- submit(func: Callable, *args) SerialFuture[source]¶
Just call the function on the arguments.
- class cnvlib.parallel.SerialFuture(result: Any = None, exception: Exception | None = None)[source]¶
Bases:
objectMimic the concurrent.futures.Future interface.
- cnvlib.parallel.pick_pool(nprocs: int) Iterator[SerialPool | ProcessPoolExecutor][source]¶
params¶
Defines several constants used in the pipeline.
Hard-coded parameters for CNVkit. These should not change between runs.
plots¶
Plotting utilities.
- cnvlib.plots.plot_chromosome_dividers(axis: Axes, chrom_sizes: collections.OrderedDict, pad: None = None, along: str = 'x') collections.OrderedDict[source]¶
Given chromosome sizes, plot divider lines and labels.
Draws black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the axis limits to the covered range.
By default, the dividers are vertical and the labels are on the X axis of the plot. If the along parameter is ‘y’, this is transposed to horizontal dividers and the labels on the Y axis.
- Returns:
A table of the position offsets of each chromosome along the specified axis.
- Return type:
OrderedDict
- cnvlib.plots.chromosome_sizes(probes: CopyNumArray, to_mb: bool = False) collections.OrderedDict[source]¶
Create an ordered mapping of chromosome names to sizes.
- cnvlib.plots.translate_region_to_bins(region, bins)[source]¶
Map genomic coordinates to bin indices.
Return a tuple of (chrom, start, end), just like unpack_range.
- cnvlib.plots.update_binwise_positions(cnarr, segments=None, variants=None)[source]¶
Convert start/end positions from genomic to bin-wise coordinates.
Instead of chromosomal basepairs, the positions indicate enumerated bins.
Revise the start and end values for all GenomicArray instances at once, where the cnarr bins are mapped to corresponding segments, and variants are grouped into cnarr bins as well – if multiple variants rows fall within a single bin, equally-spaced fractional positions are used.
Returns copies of the 3 input objects with revised start and end arrays.
- cnvlib.plots.get_repeat_slices(values)[source]¶
Find the location and size of each repeat in values.
- cnvlib.plots.cvg2rgb(cvg: float, desaturate: bool) tuple[float, float, float][source]¶
Choose a shade of red or blue representing log2-coverage value.
rna¶
RNA expression levels quantified per gene.
Process per-gene expression levels, or the equivalent, by cohort.
- cnvlib.rna.before(char)[source]¶
Create a function that extracts the substring before a delimiter character.
This is a higher-order function (closure) that creates and returns a new function. The returned function takes a string and returns the portion of the string before the first occurrence of the specified delimiter character.
- Parameters:
char (str) – The delimiter character to split on.
- Returns:
A function that takes a string and returns the substring before the first occurrence of char. If char is not in the string, returns the entire string.
- Return type:
function
Examples
>>> get_base = before(".") >>> get_base("ENSG00000001.5") 'ENSG00000001' >>> get_base("simple") 'simple'
- cnvlib.rna.filter_probes(sample_counts)[source]¶
Filter probes to only include high-quality, transcribed genes.
The human genome has ~25,000 protein coding genes, yet the RSEM output includes ~58,000 rows. Some of these rows correspond to read counts over control probes (e.g. spike-in sequences). Some rows correspond to poorly mapped genes in contigs that have not been linked to the 24 chromosomes (e.g. HLA region). Others correspond to pseudo-genes and non-coding genes. For the purposes of copy number inference, these rows are best removed.
- cnvlib.rna.load_gene_info(gene_resource, corr_fname, default_r=0.1)[source]¶
Read gene info from BioMart, and optionally TCGA, into a dataframe.
RSEM only outputs the Ensembl ID. We have used the BioMART tool in Ensembl to export a list of all Ensembl genes with a RefSeq mRNA (meaning it is high quality, curated, and bona fide gene) and resides on chromosomes 1-22, X, or Y. The tool also outputs the GC content of the gene, chromosomal coordinates of the gene, and HUGO gene symbol.
The gene resource input can be obtained from a resource bundle we provide (for reference genome hg19) or generated from BioMart.
- cnvlib.rna.load_cnv_expression_corr(fname)[source]¶
Load CNV-expression correlation coefficients from a file.
Reads a tab-delimited file containing correlation coefficients between copy number variation and gene expression levels. This file is typically generated by the cnv_expression_correlate.py script from matched CNV and expression data, such as from TCGA cohorts.
- Parameters:
fname (str) – Path to the correlation coefficients file. The file must be tab-delimited with ‘Entrez_Gene_Id’ as a column that will be used as the index.
- Returns:
DataFrame indexed by Entrez Gene ID, containing correlation coefficient columns (e.g., ‘pearson_r’, ‘spearman_r’, ‘kendall_t’).
- Return type:
See also
load_gene_infoLoads and merges gene info with correlation data
- cnvlib.rna.tsl2int(tsl)[source]¶
Convert an Ensembl Transcript Support Level (TSL) code to an integer.
The code has the format “tsl([1-5]|NA)”.
- cnvlib.rna.dedupe_ens_hugo(dframe)[source]¶
Emit the “best” index from a group of the same Ensembl ID.
The RSEM gene rows are the data of interest, and they’re associated with Ensembl IDs to indicate the transcribed gene being measured in each row. The BioMart table of gene info can have duplicate rows for Ensembl ID, which would cause duplicate rows in RSEM import if joined naively. So, we need to select a single row for each group of “gene info” rows with the same Ensembl ID (column ‘gene_id’).
The keys we can use for this are:
Entrez ID (‘entrez_id’)
Ensembl gene name (‘gene’)
Entrez gene name (‘hugo_gene’)
Entrez vs. Ensembl IDs and gene names are potentially many-to-many, e.g. CALM1/2/3. However, if we also require that the Ensembl and HUGO gene names match within a group, that (usually? always?) results in a unique row remaining.
(Example: CALM1/2/3 IDs are many-to-many, but of the 3 Entrez IDs associated with Ensembl’s CALM1, only 1 is called CALM1 in the Entrez/corr. table.)
Failing that (no matches or multiple matches), prefer a lower Entrez ID, because well-characterized, protein-coding genes tend to have been discovered and accessioned first.
- cnvlib.rna.dedupe_ens_no_hugo(dframe)[source]¶
Deduplicate Ensembl ID using Entrez ID but not HUGO gene name.
- cnvlib.rna.dedupe_tx(dframe)[source]¶
Deduplicate table rows to select one transcript length per gene.
Choose the lowest-number Entrez ID and the transcript with the greatest support (primarily) and length (secondarily).
This is done at the end of Ensembl ID deduplication, after filtering on gene names and for single-row tables.
Returns an integer row index corresponding to the original table.
- cnvlib.rna.locate_entrez_dupes(dframe)[source]¶
In case the same Entrez ID was assigned to multiple Ensembl IDs.
Use HUGO vs. HGNC name again, similar to dedupe_hugo, but instead of emiting the indices of the rows to keep, emit the indices of the extra rows – their correlation values will then be filled in with a default value (np.nan or 0.1).
It will then be as if those genes hadn’t appeared in the TCGA tables at all, i.e. CNV-expression correlation is unknown, but all entries are still retained in the BioMart table (gene_info).
- cnvlib.rna.align_gene_info_to_samples(gene_info, sample_counts, tx_lengths, normal_ids)[source]¶
Align columns and sort.
Also calculate weights and add to gene_info as ‘weight’, along with transcript lengths as ‘tx_length’.
- cnvlib.rna.normalize_read_depths(sample_depths, normal_ids)[source]¶
Normalize read depths within each sample.
Some samples have higher sequencing depth and therefore read depths need to be normalized within each sample. TCGA recommends an upper quartile normalization.
After normalizing read depths within each sample, normalize (median-center) within each gene, across samples.
Finally, convert to log2 ratios.
- cnvlib.rna.safe_log2(values, min_log2)[source]¶
Transform values to log2 scale, safely handling zeros.
- Parameters:
values (np.array) – Absolute-scale values to transform. Should be non-negative.
min_log2 (float) – Assign input zeros this log2-scaled value instead of -inf. Rather than hard-clipping, input values near 0 (especially below 2^min_log2) will be squeezed a bit above min_log2 in the log2-scale output.
- cnvlib.rna.attach_gene_info_to_cnr(sample_counts, sample_data_log2, gene_info, read_len=100)[source]¶
Join gene info to each sample’s log2 expression ratios.
Add the Ensembl gene info to the aggregated gene expected read counts, dropping genes that are not in the Ensembl table I.e., filter probes down to those genes that have names/IDs in the gene resource table.
Split out samples to individual .cnr files, keeping (most) gene info.
samutil¶
BAM utilities.
- cnvlib.samutil.idxstats(bam_fname: str, drop_unmapped: bool = False, fasta: str | None = None) DataFrame[source]¶
Get chromosome names, lengths, and number of mapped/unmapped reads.
Use the BAM index (.bai) to get the number of reads and size of each chromosome. Contigs with no mapped reads are skipped.
- cnvlib.samutil.bam_total_reads(bam_fname: str, fasta: str | None = None) int64[source]¶
Count the total number of mapped reads in a BAM file.
Uses the BAM index to do this quickly.
- cnvlib.samutil.ensure_bam_index(bam_fname: str) str[source]¶
Ensure a BAM file is indexed, to enable fast traversal & lookup.
For MySample.bam, samtools will look for an index in these files, in order:
MySample.bam.bai
MySample.bai
- cnvlib.samutil.ensure_bam_sorted(bam_fname: str, by_name: bool = False, span: int = 50, fasta: str | None = None) bool[source]¶
Test if the reads in a BAM file are sorted as expected.
by_name=True: reads are expected to be sorted by query name. Consecutive read IDs are in alphabetical order, and read pairs appear together.
by_name=False: reads are sorted by position. Consecutive reads have increasing position.
- cnvlib.samutil.is_newer_than(target_fname: str, orig_fname: str) bool[source]¶
Compare file modification times.
segfilters¶
Filter copy number segments.
- cnvlib.segfilters.require_column(*colnames) Callable[source]¶
Wrapper to coordinate the segment-filtering functions.
Verify that the given columns are in the CopyNumArray the wrapped function takes. Also log the number of rows in the array before and after filtration.
- cnvlib.segfilters.squash_by_groups(cnarr: CopyNumArray, levels: Series, by_arm: bool = False) CopyNumArray[source]¶
Reduce CopyNumArray rows to a single row within each given level.
- cnvlib.segfilters.enumerate_changes(levels: Series) Series[source]¶
Assign a unique integer to each run of identical values.
Repeated but non-consecutive values will be assigned different integers.
- cnvlib.segfilters.squash_region(cnarr: DataFrame) DataFrame[source]¶
Reduce a CopyNumArray to 1 row, keeping fields sensible.
Most fields added by the segmetrics command will be dropped.
- cnvlib.segfilters.ampdel(segarr: CopyNumArray) CopyNumArray[source]¶
Merge segments by amplified/deleted/neutral copy number status.
Follow the clinical reporting convention:
5+ copies (2.5-fold gain) is amplification
0 copies is homozygous/deep deletion
CNAs of lesser degree are not reported
This is recommended only for selecting segments corresponding to actionable, usually focal, CNAs. Any real and potentially informative but lower-level CNAs will be dropped.
- cnvlib.segfilters.bic(segarr)[source]¶
Merge segments by Bayesian Information Criterion.
See: BIC-seq (Xi 2011), doi:10.1073/pnas.1110574108
- cnvlib.segfilters.ci(segarr: CopyNumArray) CopyNumArray[source]¶
Merge segments by confidence interval (overlapping 0).
Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
- cnvlib.segfilters.cn(segarr: CopyNumArray) CopyNumArray[source]¶
Merge segments by integer copy number.
- cnvlib.segfilters.sem(segarr: CopyNumArray, zscore: float = 1.96) CopyNumArray[source]¶
Merge segments by Standard Error of the Mean (SEM).
Use each segment’s SEM value to estimate a 95% confidence interval (via zscore). Segments with lower CI above 0 are kept as gains, upper CI below 0 as losses, and the rest with CI overlapping zero are collapsed as neutral.
smoothing¶
Signal-smoothing functions.
- cnvlib.smoothing.check_inputs(x: pd.Series | ndarray, width: int | float, as_series: bool = True, weights: ndarray | None = None) tuple[ndarray, int, pd.Series] | tuple[ndarray, int, ndarray, ndarray] | tuple[ndarray, int, ndarray][source]¶
Transform width into a half-window size.
width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.
- cnvlib.smoothing.rolling_median(x: pd.Series, width: float) ndarray[source]¶
Rolling median with mirrored edges.
- cnvlib.smoothing.rolling_quantile(x: pd.Series, width: int, quantile: float) ndarray[source]¶
Rolling quantile (0–1) with mirrored edges.
- cnvlib.smoothing.convolve_weighted(window: ndarray, signal: ndarray, weights: ndarray, n_iter: int = 1) tuple[ndarray, ndarray][source]¶
Convolve a weighted window over a weighted signal array.
- cnvlib.smoothing.convolve_unweighted(window, signal, wing, n_iter=1)[source]¶
Convolve a weighted window over array signal.
Input array is assumed padded by _pad_array; output has padding removed.
- cnvlib.smoothing.guess_window_size(x: Series, weights: Series | None = None) int[source]¶
Choose a reasonable window size given the signal.
Inspired by Silverman’s rule: bandwidth is proportional to signal’s standard deviation and the length of the signal ^ 4/5.
- cnvlib.smoothing.kaiser(x, width=None, weights=None, do_fit_edges=False)[source]¶
Smooth the values in x with the Kaiser windowed filter.
See: https://en.wikipedia.org/wiki/Kaiser_window
- Parameters:
x (array-like) – 1-dimensional numeric data set.
width (float) – Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.
- cnvlib.smoothing.savgol(x: pd.Series | ndarray, total_width: int | None = None, weights: ndarray | None = None, window_width: int = 7, order: int = 3, n_iter: int = 1) ndarray[source]¶
Savitzky-Golay smoothing.
Fitted polynomial order is typically much less than half the window width.
total_width overrides n_iter.
- cnvlib.smoothing.outlier_iqr(a, c=3.0)[source]¶
Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.
- cnvlib.smoothing.outlier_mad_median(a)[source]¶
MAD-Median rule for detecting outliers.
X_i is an outlier if:
| X_i - M | _____________ > K ~= 2.24 MAD / 0.6745
where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.
This is a very robust rule with the highest possible breakdown point of 0.5.
- Returns:
A boolean array of the same size as a, where outlier indices are True.
- Return type:
np.array
References
Davies & Gather (1993) The Identification of Multiple Outliers.
Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
- cnvlib.smoothing.rolling_outlier_iqr(x, width, c=3.0)[source]¶
Detect outliers as a multiple of the IQR from the median.
By convention, “outliers” are points more than 1.5 * IQR from the median (~2 SD if values are normally distributed), and “extremes” or extreme outliers are those more than 3.0 * IQR (~4 SD).
- cnvlib.smoothing.rolling_outlier_quantile(x: pd.Series, width: int, q: float, m: int) pd.Series | ndarray[source]¶
Detect outliers by multiples of a quantile in a window.
Outliers are the array elements outside m times the q’th quantile of deviations from the smoothed trend line, as calculated from the trend line residuals. (For example, take the magnitude of the 95th quantile times 5, and mark any elements greater than that value as outliers.)
This is the smoothing method used in BIC-seq (doi:10.1073/pnas.1110574108) with the parameters width=200, q=.95, m=5 for WGS.
- Returns:
A boolean array of the same size as x, where outlier indices are True.
- Return type:
np.array
- cnvlib.smoothing.rolling_outlier_std(x, width, stdevs)[source]¶
Detect outliers by stdev within a rolling window.
Outliers are the array elements outside stdevs standard deviations from the smoothed trend line, as calculated from the trend line residuals.
- Returns:
A boolean array of the same size as x, where outlier indices are True.
- Return type:
np.array