Python API (cnvlib package)

Module cnvlib contents

cnvlib.read(fname)[source]

Parse a file as a copy number or copy ratio table (.cnn, .cnr).

Submodules

cnarray

The core object used throughout CNVkit. For your own scripting, you can usually accomplish what you need using CopyNumArray methods.

Definitions for the core data structure, a copy number array.

class cnvlib.cnarray.CopyNumArray(sample_id, extra_column_names=())[source]

Bases: object

An array of genomic intervals, treated like aCGH probes.

add_columns(**columns)[source]

Create a new CNA, adding the specified extra columns to this CNA.

by_bin(bins)[source]

Group rows by another CopyNumArray; trim row start/end to bin edges.

Returns an iterable of (bin, CopyNumArray of overlapping cnarray rows))

If a probe overlaps with a bin boundary, the probe start or end position is replaced with the bin boundary position. Probes outside any segments are skipped. This is appropriate for most other comparisons between CopyNumArray objects.

by_chromosome()[source]

Iterate over probes grouped by chromosome name.

by_gene(ignore=('-', 'CGH', '.'))[source]

Iterate over probes grouped by gene name.

Emits pairs of (gene name, CNA of rows with same name)

Groups each series of intergenic bins as a ‘Background’ gene; any ‘Background’ bins within a gene are grouped with that gene. Bins with names in ignore are treated as ‘Background’ bins, but retain their name.

by_segment(segments)[source]

Group cnarray rows by the segments that row midpoints land in.

Returns an iterable of segments and rows grouped by overlap with each segment.

Note that segments don’t necessarily cover all probes (some near telo/centromeres may have been dropped as outliers during segmentation). These probes are grouped with the nearest segment, so the endpoint of the first/last probe may not match the corresponding segment endpoint. This is appropriate if the segments were obtained from this probe array.

center_all(peak=False)[source]

Recenter coverage values to the autosomes’ average (in-place).

chromosome
copy()[source]

Create an independent copy of this object.

coverage
drop_extra_columns()[source]

Remove any optional columns from this CopyNumArray.

Returns a new copy with only the core columns retained:
log2 value, chromosome, start, end, bin name.
end
classmethod from_array(sample_id, array)[source]
classmethod from_columns(sample_id, **columns)[source]
classmethod from_rows(sample_id, row_data, extra_keys=())[source]
gene
in_range(chrom, start=0, end=None, trim=False)[source]

Get the CopyNumArray portion within the given genomic range.

If trim=True, include bins straddling the range boundaries, and trim the bins endpoints to the boundaries.

labels()[source]
merge(other)[source]

Combine this array’s data with another CopyNumArray (in-place).

Any optional columns must match between both arrays.

classmethod read(infile, sample_id=None)[source]

Parse a tabular table of coverage data from a handle or filename.

select(selector=None, **kwargs)[source]

Take a subset of rows where the given condition is true.

Arguments can be a function (lambda expression) returning a bool, which will be used to select True rows, and/or keyword arguments like gene=”Background” or chromosome=”chr7”, which will select rows where the keyed field equals the specified value.

shuffle()[source]

Randomize the order of bins in this array (in-place).

sort(key=None)[source]

Sort the bins in this array (in-place).

Optional argument ‘key’ is one of:

  • a function that computes a sorting key from a CopyNumArray row
  • a string identifier for an existing data column
  • a list/array/iterable of precomputed keys equal in length to the number of rows in this CopyNumArray.

By default, bins are sorted by chromosomal coordinates.

squash_genes(ignore=('-', 'CGH', '.'), squash_background=False, summary_stat=<function biweight_location>)[source]

Combine consecutive bins with the same targeted gene name.

The ignore parameter lists bin names that not be counted as genes to be output.

Parameter summary_stat is a function that summarizes an array of coverage values to produce the “squashed” gene’s coverage value. By default this is the biweight location, but you might want median, mean, max, min or something else in some cases.

Optional columns, if present, are dropped.

start
to_array(array)[source]
to_rows(rows)[source]

Like from_rows, reusing this instance’s metadata.

write(outfile=<open file '<stdout>', mode 'w'>)[source]

Write coverage data to a file or handle in tabular format.

This is similar to BED or BedGraph format, but with extra columns.

To combine multiple samples in one file and/or convert to another format, see the ‘export’ subcommand.

cnvlib.cnarray.row2label(row)[source]

commands

The public API for each of the commands defined in the CNVkit workflow.

Command-line interface and corresponding API for CNVkit.

class cnvlib.commands.SerialPool[source]

Bases: object

Mimic the multiprocessing.Pool interface, but run in serial.

apply_async(func, args)[source]

Just call the function.

close()[source]
join()[source]
cnvlib.commands.batch_make_reference(normal_bams, target_bed, antitarget_bed, male_reference, fasta, annotate, short_names, split, target_avg_size, access, antitarget_avg_size, antitarget_min_size, output_reference, output_dir, processes, by_count)[source]

Build the CN reference from normal samples, targets and antitargets.

cnvlib.commands.batch_run_sample(bam_fname, target_bed, antitarget_bed, ref_fname, output_dir, male_reference=False, scatter=False, diagram=False, rlibpath=None, by_count=False)[source]

Run the pipeline on one BAM file.

cnvlib.commands.batch_write_coverage(bed_fname, bam_fname, out_fname, by_count)[source]

Run coverage on one sample, write to file.

cnvlib.commands.create_heatmap(filenames, show_chromosome=None, do_desaturate=False)[source]

Plot copy number for multiple samples as a heatmap.

cnvlib.commands.create_loh(variants, min_depth=20, do_trend=False, sample_id=None)[source]

Plot allelic frequencies at each variant position in a VCF file.

cnvlib.commands.do_antitarget(target_bed, access_bed=None, avg_bin_size=150000, min_bin_size=None)[source]

Derive a background/antitarget BED file from a target BED file.

cnvlib.commands.do_breaks(probes, segments, min_probes=1)[source]

List the targeted genes in which a copy number breakpoint occurs.

cnvlib.commands.do_coverage(bed_fname, bam_fname, by_count=False)[source]

Calculate coverage in the given regions from BAM read depths.

cnvlib.commands.do_fix(target_raw, antitarget_raw, reference, do_gc=True, do_edge=True, do_rmask=True)[source]

Combine target and antitarget coverages and correct for biases.

cnvlib.commands.do_gainloss(probes, segments=None, male_reference=False, threshold=0.5, min_probes=3)[source]

Identify targeted genes with copy number gain or loss.

cnvlib.commands.do_reference(target_fnames, antitarget_fnames, fa_fname=None, male_reference=False)[source]

Compile a coverage reference from the given files (normal samples).

cnvlib.commands.do_reference_flat(target_list, antitarget_list, fa_fname=None, male_reference=False)[source]

Compile a neutral-coverage reference from the given intervals.

Combines the intervals, shifts chrX values if requested, and calculates GC and RepeatMasker content from the genome FASTA sequence.

cnvlib.commands.do_scatter(pset_cvg, pset_seg=None, vcf_fname=None, show_chromosome=None, show_gene=None, show_range=None, background_marker=None, do_trend=False, window_width=1000000.0, sample_id=None, min_variant_depth=20, y_min=None, y_max=None)[source]

Plot probe log2 coverages and CBS calls together.

cnvlib.commands.do_targets(bed_fname, out_fname, annotate=None, do_short_names=False, do_split=False, avg_size=266.6666666666667)[source]

Transform bait intervals into targets more suitable for CNVkit.

cnvlib.commands.parse_args(args=None)[source]

Parse the command line.

cnvlib.commands.pick_pool(nprocs)[source]
cnvlib.commands.print_version(_args)[source]

Display this program’s version.

antitarget

Supporting functions for the ‘antitarget’ command.

cnvlib.antitarget.find_background_regions(access_chroms, target_chroms, pad_size)[source]

Take coordinates of accessible regions and targets; emit antitargets.

Note that a chromosome must be present in the target library in order to be included in the antitargets generated here. So, if chrY is missing from your output files, it’s probably because it had no targets.

cnvlib.antitarget.get_background(target_bed, access_bed, avg_bin_size, min_bin_size)[source]

Generate background intervals from target intervals.

Procedure:

  • Invert target intervals

  • Subtract the inverted targets from accessible regions

  • For each of the resulting regions:

    • Shrink by a fixed margin on each end
    • If it’s smaller than min_bin_size, skip
    • Divide into equal-size (region_size/avg_bin_size) portions
    • Emit the (chrom, start, end) coords of each portion
cnvlib.antitarget.group_coords(coordinates)[source]

Group chromosomal coordinates into a dictionary.

cnvlib.antitarget.guess_chromosome_regions(target_chroms, telomere_size)[source]

Determine (minimum) chromosome lengths from target coordinates.

core

CNV utilities.

cnvlib.core.assert_equal(msg, **values)[source]

Evaluate and compare two or more values for equality.

Sugar for a common assertion pattern. Saves re-evaluating (and retyping) the same values for comparison and error reporting.

Example:

>>> assert_equal("Mismatch", expected=1, saw=len(['xx', 'yy']))
...
ValueError: Mismatch: expected = 1, saw = 2
cnvlib.core.check_unique(items, title)[source]

Ensure all items in an iterable are identical; return that one item.

cnvlib.core.expect_flat_cvg(cnarr, is_male_reference=None, chr_x=None)[source]

Get the uninformed expected copy ratios of each bin.

Create an array of log2 coverages like a “flat” reference.

This is a neutral copy ratio at each autosome (log2 = 0.0) and sex chromosomes based on whether the reference is male (XX or XY).

cnvlib.core.fbase(fname)[source]

Strip directory and all extensions from a filename.

cnvlib.core.get_relative_chrx_cvg(probes, chr_x=None)[source]

Get the relative log-coverage of chrX in a sample.

cnvlib.core.guess_chr_x(probes)[source]
cnvlib.core.guess_xx(probes, male_reference=False, chr_x=None, verbose=True)[source]

Guess whether a sample is female from chrX relative coverages.

Recommended cutoff values:
-0.5 – raw target data, not yet corrected +0.5 – probe data already corrected on a male profile
cnvlib.core.parse_tsv(infile, keep_header=False)[source]

Parse a tabular data table into an iterable of lists.

Rows are split on tabs. Header row is optionally included in the output.

cnvlib.core.rbase(fname)[source]

Strip directory and final extension from a filename.

cnvlib.core.shift_xx(probes, male_reference=False, chr_x=None)[source]

Adjust chrX coverages (divide in half) for apparent female samples.

cnvlib.core.sorter_chrom(label)[source]

Create a sorting key from chromosome label.

Sort by integers first, then letters or strings. The prefix “chr” (case-insensitive), if present, is stripped automatically for sorting.

E.g. chr1 < chr2 < chr10 < chrX < chrY

cnvlib.core.sorter_chrom_at(index)[source]

Create a sort key function that gets chromosome label at a list index.

cnvlib.core.write_tsv(outfname, table, colnames=None)[source]

Write the CGH file.

coverage

Supporting functions for the ‘antitarget’ command.

cnvlib.coverage.bam_total_reads(bam_fname)[source]

Count the total number of mapped reads in a BAM file.

Uses the BAM index to do this quickly.

cnvlib.coverage.bedcov(bed_fname, bam_fname)[source]

Calculate depth of all regions in a BED file via samtools (pysam) bedcov.

i.e. mean pileup depth across each region.

cnvlib.coverage.filter_column(col)[source]

Count the number of filtered reads in a pileup column.

cnvlib.coverage.filter_read(read)[source]

True if the given read should be counted towards coverage.

cnvlib.coverage.interval_coverages(bed_fname, bam_fname, by_count)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_count(bed_fname, bam_fname)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.interval_coverages_pileup(bed_fname, bam_fname)[source]

Calculate log2 coverages in the BAM file at each interval.

cnvlib.coverage.region_depth_count(bamfile, chrom, start, end)[source]

Calculate depth of a region via pysam count.

i.e. counting the number of read starts in a region, then scaling for read length and region width to estimate depth.

Coordinates are 0-based, per pysam.

diagram

Chromosome diagram drawing functions.

This uses and abuses Biopython’s BasicChromosome module. It depends on ReportLab, too, so we isolate this functionality here so that the rest of CNVkit will run without it. (And also to keep the codebase tidy.)

cnvlib.diagram.bc_chromosome_draw_label(self, cur_drawing, label_name)[source]

Monkeypatch to Bio.Graphics.BasicChromosome.Chromosome._draw_label.

Draw a label for the chromosome. Mod: above the chromosome, not below.

cnvlib.diagram.bc_organism_draw(org, title, wrap=12)[source]

Modified copy of Bio.Graphics.BasicChromosome.Organism.draw.

Instead of stacking chromosomes horizontally (along the x-axis), stack rows vertically, then proceed with the chromosomes within each row.

Arguments:

  • title: The output title of the produced document.
cnvlib.diagram.build_chrom_diagram(features, chr_sizes, sample_id)[source]

Create a PDF of color-coded features on chromosomes.

cnvlib.diagram.create_diagram(cnarr, segarr, threshold, min_probes, outfname, male_reference)[source]

Create the diagram.

export

Export CNVkit objects and files to other formats.

class cnvlib.export.ProbeInfo(label, chrom, start, end, gene)

Bases: tuple

chrom

Alias for field number 1

end

Alias for field number 3

gene

Alias for field number 4

label

Alias for field number 0

start

Alias for field number 2

cnvlib.export.calculate_theta_fields(seg, ref_rows, chrom_id)[source]

Convert a segment’s info to a row of THetA input.

For the normal/reference bin count, take the mean of the bin values within each segment so that segments match between tumor and normal.

cnvlib.export.cna_absolutes(cnarr, ploidy, purity, is_reference_male, is_sample_female)[source]

Calculate absolute copy number values from segment or bin log2 ratios.

cnvlib.export.create_chrom_ids(segments)[source]

Map chromosome names to integers in the order encountered.

cnvlib.export.export_freebayes(sample_fnames, args)[source]

Export to FreeBayes –cnv-map format.

Which is BED-like, for each region in each sample which does not have neutral copy number (equal to 2 or the value set by –ploidy), with columns:

  • reference sequence
  • start (0-indexed)
  • end
  • sample name
  • copy number
cnvlib.export.export_nexus_basic(sample_fname)[source]

Biodiscovery Nexus Copy Number “basic” format.

Only represents one sample per file.

cnvlib.export.export_seg(sample_fnames)[source]

SEG format for copy number segments.

Segment breakpoints are not the same across samples, so samples are listed in serial with the sample ID as the left column.

cnvlib.export.export_theta(tumor, reference)[source]

Convert tumor segments and normal .cnr or reference .cnn to THetA input.

Follows the THetA segmentation import script but avoid repeating the pileups, since we already have the mean depth of coverage in each target bin.

The options for average depth of coverage and read length do not matter crucially for proper operation of THetA; increased read counts per bin simply increase the confidence of THetA’s results.

THetA2 input format is tabular, with columns:
ID, chrm, start, end, tumorCount, normalCount

where chromosome IDs (“chrm”) are integers 1 through 24.

cnvlib.export.fmt_cdt(sample_ids, rows)[source]

Format as CDT.

cnvlib.export.fmt_gct(sample_ids, rows)[source]
cnvlib.export.fmt_jtv(sample_ids, rows)[source]

Format for Java TreeView.

cnvlib.export.fmt_multi(sample_ids, rows)[source]
cnvlib.export.fmt_vcf(sample_ids, rows)[source]
cnvlib.export.merge_rows(rows)[source]

Combine equivalent rows of coverage data across multiple samples.

Check that probe info matches across all samples, then merge the log2 coverage values.

Input: a list of individual rows corresponding to the same probes from different coverage files. Output: a list starting with the single common Probe object, followed by the log2 coverage values from each sample, in order.

cnvlib.export.merge_samples(filenames)[source]

Merge probe values from multiple samples into a 2D table (of sorts).

Input:
dict of {sample ID: (probes, values)}
Output:
list-of-tuples: (probe, log2 coverages...)
cnvlib.export.rescale_copy_ratios(cnarr, purity=None, ploidy=2, round_to_integer=False, is_sample_female=None, is_reference_male=True)[source]

Rescale segment copy ratio values given a known tumor purity.

cnvlib.export.row_to_probe_coverage(row)[source]

Repack a parsed row into a ProbeInfo instance and coverage value.

cnvlib.export.segments2freebayes(segments, sample_name, ploidy, purity, is_reference_male, is_sample_female)[source]

Convert a copy number array to a BED-like format.

fix

Supporting functions for the ‘fix’ command.

cnvlib.fix.apply_weights(cnarr, ref_matched, epsilon=0.0001)[source]

Calculate weights for each bin.

Weights are derived from:

  • bin sizes
  • average bin coverage depths in the reference
  • the “spread” column of the reference.
cnvlib.fix.center_by_window(pset, fraction, sort_key)[source]

Smooth out biases according to the trait specified by sort_key.

E.g. correct GC-biased probes by windowed averaging across similar-GC probes; or for similar interval sizes.

cnvlib.fix.edge_gain(target_size, insert_size, gap_size)[source]

Calculate coverage gain from a neighboring bait’s flanking reads.

Letting i = insert size, t = target size, g = gap to neighboring bait, the gain of coverage due to a nearby bait, if g < i, is:

(i-g)^2 / 4it

If the neighbor flank extends beyond the target (t+g < i), reduce by:

(i-t-g)^2 / 4it
cnvlib.fix.edge_loss(target_size, insert_size)[source]

Calculate coverage loss at the edges of a baited region.

Letting i = insert size and t = target size, the proportional loss of coverage near the two edges of the baited region (combined) is:

i/2t

If the “shoulders” extend outside the bait $(t < i), reduce by:

(i-t)^2 / 4it

on each side, or (i-t)^2 / 2it total.

cnvlib.fix.load_adjust_coverages(pset, ref_pset, fix_gc, fix_edge, fix_rmask)[source]

Load and filter probe coverages; correct using reference and GC.

cnvlib.fix.make_edge_sorter(target_probes, margin)[source]

Create a sort-key function for tiling edge effects.

cnvlib.fix.match_ref_to_probes(ref_pset, probes)[source]

Filter the reference probes to match the target or antitarget probe set.

importers

Import from other formats to the CNVkit format.

cnvlib.importers.find_picard_files(file_and_dir_names)[source]

Search the given paths for ‘targetcoverage’ CSV files.

Per the convention we use in our Picard applets, the target coverage file names end with ‘.targetcoverage.csv’; anti-target coverages end with ‘.antitargetcoverage.csv’.

cnvlib.importers.import_seg(segfname, chrom_names, chrom_prefix, from_log10)[source]

Parse a SEG file. Emit pairs of (sample ID, CopyNumArray)

Values are converted from log10 to log2.

chrom_names:
Map (string) chromosome IDs to names. (Applied before chrom_prefix.) e.g. {‘23’: ‘X’, ‘24’: ‘Y’, ‘25’: ‘M’}
chrom_prefix: prepend this string to chromosome names
(usually ‘chr’ or None)
cnvlib.importers.load_targetcoverage_csv(fname)[source]

Parse a target or antitarget coverage file (.csv) into a CopyNumArray.

These files are generated by Picard CalculateHsMetrics. The fields of the .csv files are actually separated by tabs, not commas.

CSV column names:
chrom (str), start, end, length (int), name (str), %gc, mean_coverage, normalized_coverage (float)
cnvlib.importers.parse_theta_results(fname)[source]

Parse THetA results into a data structure.

Columns: NLL, mu, C, p*

cnvlib.importers.unpipe_name(name)[source]

Fix the duplicated gene names Picard spits out.

Return a string containing the single gene name, sans duplications and pipe characters.

Picard CalculateHsMetrics combines the labels of overlapping intervals by joining all labels with ‘|’, e.g. ‘BRAF|BRAF’ – no two distinct targeted genes actually overlap, though, so these dupes are redundant.

Also, in our convention, ‘CGH’ probes are selected intergenic regions, not meaningful gene names, so ‘CGH|FOO’ resolves as ‘FOO’.

metrics

Robust estimators of central tendency and scale.

For use in evaluating performance of copy number estimation.

See:
http://en.wikipedia.org/wiki/Robust_measures_of_scale http://astropy.readthedocs.org/en/latest/_modules/astropy/stats/funcs.html
cnvlib.metrics.biweight_location(a, initial=None, c=6.0, epsilon=0.0001)[source]

Compute the biweight location for an array.

The biweight is a robust statistic for determining the central location of a distribution.

cnvlib.metrics.biweight_midvariance(a, initial=None, c=9.0, epsilon=0.0001)[source]

Compute the biweight midvariance for an array.

The biweight midvariance is a robust statistic for determining the midvariance (i.e. the standard deviation) of a distribution.

See: http://en.wikipedia.org/wiki/Robust_measures_of_scale#The_biweight_midvariance http://astropy.readthedocs.org/en/latest/_modules/astropy/stats/funcs.html

cnvlib.metrics.ests_of_scale(deviations)[source]

Estimators of scale: standard deviation, MAD, biweight midvariance.

Calculates all of these values for an array of deviations and returns them as a tuple.

cnvlib.metrics.interquartile_range(a)[source]

Compute the difference between the array’s first and third quartiles.

cnvlib.metrics.median_absolute_deviation(a, scale_to_sd=True)[source]

Compute the median absolute deviation (MAD) of array elements.

The MAD is defined as: median(abs(a - median(a))).

See: http://en.wikipedia.org/wiki/Median_absolute_deviation

cnvlib.metrics.probe_deviations_from_segments(probes, segments, skip_low=True)[source]

Difference in CN estimate of each probe from its segment.

cnvlib.metrics.q_n(a)[source]

Rousseeuw & Croux’s (1993) Q_n, an alternative to MAD.

Qn := Cn first quartile of (|x_i - x_j|: i < j)

where Cn is a constant depending on n.

Finite-sample correction factors must be used to calibrate the scale of Qn for small-to-medium-sized samples.

n E[Qn] – —– 10 1.392 20 1.193 40 1.093 60 1.064 80 1.048 100 1.038 200 1.019
cnvlib.metrics.segment_mean(cnarr)[source]

Weighted average of bin log2 values, ignoring too-low-coverage bins.

ngfrills

NGS utilities.

cnvlib.ngfrills.call_quiet(*args)[source]

Safely run a command and get stdout; print stderr if there’s an error.

Like subprocess.check_output, but silent in the normal case where the command logs unimportant stuff to stderr. If there is an error, then the full error message(s) is shown in the exception message.

cnvlib.ngfrills.ensure_path(fname)[source]

Create dirs and move an existing file to avoid overwriting, if necessary.

If a file already exists at the given path, it is renamed with an integer suffix to clear the way.

cnvlib.ngfrills.safe_write(*args, **kwds)[source]

Write to a filename or file-like object with error handling.

If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.

cnvlib.ngfrills.temp_write_text(*args, **kwds)[source]

Save text to a temporary file.

NB: This won’t work on Windows b/c the file stays open.

params

Hard-coded parameters for CNVkit. These should not change between runs.

plots

Plotting utilities.

cnvlib.plots.chromosome_sizes(probes, to_mb=False)[source]

Create an ordered mapping of chromosome names to sizes.

cnvlib.plots.cvg2rgb(cvg, desaturate)[source]

Choose a shade of red or blue representing log2-coverage value.

cnvlib.plots.gene_coords_by_name(probes, names)[source]

Find the chromosomal position of each named gene in probes.

Returns a dict: {chromosome: [(start, end, gene name), ...]}

cnvlib.plots.gene_coords_by_range(probes, chrom, start, end, skip=('Background', 'CGH', '-', '.'))[source]

Find the chromosomal position of all genes in a range.

Returns a dict: {chromosome: [(start, end, gene), ...]}

cnvlib.plots.group_snvs_by_segments(snv_posns, snv_freqs, segments, chrom)[source]

Group SNP allele frequencies by segment.

Return an iterable of: start, end, value.

cnvlib.plots.limit(x, lower, upper)[source]

Limit x to between lower and upper bounds.

cnvlib.plots.parse_range(text)[source]

Parse a chromosomal range specification.

Range spec string should look like: ‘chr1:1234-5678’

cnvlib.plots.partition_by_chrom(chrom_snvs)[source]

Group the tumor shift values by chromosome (for statistical testing).

cnvlib.plots.plot_chromosome(axis, probes, segments, chromosome, sample, genes, background_marker=None, do_trend=False, y_min=None, y_max=None)[source]

Draw a scatter plot of probe values with CBS calls overlaid.

Argument ‘genes’ is a list of tuples: (start, end, gene name)

cnvlib.plots.plot_genome(axis, probes, segments, pad, do_trend=False, y_min=None, y_max=None)[source]

Plot coverages and CBS calls for all chromosomes on one plot.

cnvlib.plots.plot_loh(axis, chrom_snvs, chrom_sizes, segments, do_trend, pad)[source]

Plot a scatter-plot of SNP chromosomal positions and shifts.

cnvlib.plots.plot_x_dividers(axis, chromosome_sizes, pad)[source]

Plot vertical dividers and x-axis labels given the chromosome sizes.

Returns a table of the x-position offsets of each chromosome.

Draws vertical black lines between each chromosome, with padding. Labels each chromosome range with the chromosome name, centered in the region, under a tick. Sets the x-axis limits to the covered range.

cnvlib.plots.probe_center(row)[source]

Return the midpoint of the probe location.

cnvlib.plots.test_loh(bins, alpha=0.0025)[source]

Test each chromosome’s SNP shifts and the combined others’.

The statistical test is Mann-Whitney, a one-sided non-parametric test for difference in means.

reference

Supporting functions for the ‘reference’ command.

cnvlib.reference.bed2probes(bed_fname)[source]

Create neutral-coverage probes from intervals.

cnvlib.reference.calculate_gc_lo(subseq)[source]

Calculate the GC and lowercase (RepeatMasked) content of a string.

cnvlib.reference.combine_probes(filenames, fa_fname, is_male_reference)[source]

Calculate the median coverage of each bin across multiple samples.

Input:
List of .cnn files, as generated by ‘coverage’ or ‘import-picard’. fa_fname: fil columns for GC and RepeatMasker genomic values.
Returns:
A single CopyNumArray summarizing the coverages of the input samples, including each bin’s “average” coverage, “spread” of coverages, and genomic GC content.
cnvlib.reference.get_fasta_stats(probes, fa_fname)[source]

Calculate GC and RepeatMasker content of each bin in the FASTA genome.

cnvlib.reference.mask_bad_probes(probes)[source]

Flag the probes with excessively low or inconsistent coverage.

Returns a bool array where True indicates probes that failed the checks.

cnvlib.reference.reference2regions(reference, coord_only=False)[source]

Extract iterables of target and antitarget regions from a reference CNA.

Like loading two BED files with ngfrills.parse_regions.

cnvlib.reference.warn_bad_probes(probes)[source]

Warn about target probes where coverage is poor.

Prints a formatted table to stderr.

reports

Supporting functions for the text/tabular-reporting commands.

Namely: breaks, gainloss.

cnvlib.reports.gainloss_by_gene(probes, threshold)[source]

Identify genes where average bin copy ratio value exceeds threshold.

NB: Must shift sex-chromosome values beforehand with core.shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.gainloss_by_segment(probes, segments, threshold)[source]

Identify genes where segmented copy ratio exceeds threshold.

NB: Must shift sex-chromosome values beforehand with core.shift_xx, otherwise all chrX/chrY genes may be reported gained/lost.

cnvlib.reports.get_breakpoints(intervals, segments, min_probes)[source]

Identify CBS segment breaks within the targeted intervals.

cnvlib.reports.get_gene_intervals(all_probes, skip=('Background', 'CGH', '-', '.'))[source]

Tally genomic locations of each targeted gene.

Return a dict of chromosomes to a list of tuples: (gene name, start, end).

cnvlib.reports.group_by_genes(probes)[source]

Group probe and coverage data by gene.

Return an iterable of genes, in chromosomal order, associated with their location and coverages:

[(gene, chrom, start, end, [coverages]), ...]

segmentation

Segmentation of copy number values.

cnvlib.segmentation.do_segmentation(probes_fname, save_dataframe, method, rlibpath=None)[source]

Infer copy number segments from the given coverage table.

cnvlib.segmentation.squash_segments(seg_pset)[source]

Combine contiguous segments.

smoothing

Signal smoothing functions.

cnvlib.smoothing.check_inputs(x, width)[source]

Transform width into a half-window size.

width is either a fraction of the length of x or an integer size of the whole window. The output half-window size is truncated to the length of x if needed.

cnvlib.smoothing.fit_edges(x, y, wing, polyorder=3)[source]

Apply polynomial interpolation to the edges of y, in-place.

Calculates a polynomial fit (of order polyorder) of x within a window of width twice wing, then updates the smoothed values y in the half of the window closest to the edge.

cnvlib.smoothing.outlier_iqr(a, c=1.5)[source]

Detect outliers as a multiple of the IQR from the median.

By convention, “outliers” are points more than 1.5 * IQR from the median, and “extremes” or extreme outliers are those more than 3.0 * IQR.

cnvlib.smoothing.outlier_mad_median(a)[source]

MAD-Median rule for detecting outliers.

Returns: a boolean array of the same size, where outlier indices are True.

X_i is an outlier if:

 | X_i - M |
_____________  > K ~= 2.24

 MAD / 0.6745

where $K = sqrt( X^2_{0.975,1} )$, the square root of the 0.975 quantile of a chi-squared distribution with 1 degree of freedom.

This is a very robust rule with the highest possible breakdown point of 0.5.

See:

  • Davies & Gather (1993) The Identification of Multiple Outliers.
  • Rand R. Wilcox (2012) Introduction to robust estimation and hypothesis testing. Ch.3: Estimating measures of location and scale.
cnvlib.smoothing.rolling_median(x, width)[source]

Rolling median.

Contributed by Peter Otten to comp.lang.python.

Source: https://bitbucket.org/janto/snippets/src/tip/running_median.py https://groups.google.com/d/msg/comp.lang.python/0OARyHF0wtA/SEs-glW4t6gJ

cnvlib.smoothing.smooth_genome_coverages(probes, smooth_func, width)[source]

Fit a trendline through probe coverages, handling chromosome boundaries.

Returns an array of smoothed coverage values, calculated with smooth_func and width, equal in length to probes.

cnvlib.smoothing.smoothed(x, width, do_fit_edges=False)[source]

Smooth the values in x with the Kaiser windowed filter.

See: http://en.wikipedia.org/wiki/Kaiser_window

Parameters:

x : array-like
1-dimensional numeric data set.
width : float
Fraction of x’s total length to include in the rolling window (i.e. the proportional window width), or the integer size of the window.