- Update gene names (the ‘gene’ column) in CNVkit .cnn/.cnr files, using gene annotations from another UCSC RefFlat, BED, or GFF file (e.g. refFlat.txt). This may be useful if you notice at the end of an analysis that vendor-annotated targets are not the desired gene names, and want to change the labeling without repeating the analysis with an updated target BED file.
- Create a table of correlation coefficients between gene copy number and mRNA expression. See: RNA expression
Update .cnn, .cnr and .cns files previously generated by earlier versions of CNVkit to add a “depth” column used in CNVkit version 0.8.0 and later. The script reads each input file, calculates absolute-scale depth from the file’s existing “log2” column value in each row, and creates a corresponding output file with a modified name – the input files are not modified in-place.
Running this script is not necessary for new analyses, but may help ease the transition for analyses that have already begun.
- Test each bin in a .cnr file individually for non-neutral copy number.
Specifically, calculate the probability of a bin’s log2 value versus a
normal distribution with a mean of of 0 and standard deviation
back-calculated from bin weight. Output another .cnr file with z-test
probabilities in the additional column “ztest”; drop rows where the
probability is above the threshold (
Use the read depths in one or more given BAM files to infer which regions were targeted in a hybrid capture or targeted amplicon capture sequencing protocol. This script can be used in case the original BED file of targeted intervals is unavailable. (However, CNVkit will give much better results if the true targeted intervals can be provided.) It works in 2 modes, guided and unguided:
Guided: Given candidate targets, such as all known exons in the reference genome, test the mean coverage depth in each candidate target and drop those that did not receive sufficient coverage, presumed to be those exons or genes that were not targeted by the sequencing library.
guess_baits.py Sample1.bam Sample2.bam -t ucsc-exons.bed -o baits.bed
Unguided: Scan every base in the sample BAM(s), inferring likely boundaries for enriched regions. (This is usually much slower then the guided approach.)
guess_baits.py -g access.hg19.bed Sample1.bam Sample2.bam -o baits.bed
In either mode, the input region coordinates can be provided in any of the formats handled by skgenome.tabio, but it’s best to first run them through either the command target or script
skg_convert.py --flatten(see below) to ensure the input regions do not overlap.
Extract target and antitarget BED files from a CNVkit reference file. While the batch command does this step automatically when an existing reference is provided, you may find this standalone script useful to recover the target and antitarget BED files that match the reference if those BED files are missing or you’re not sure which ones are correct.
Alternatively, once you have a stable CNVkit reference for your platform, you can use this script to drop the “bad” bins from your target and antitarget BED files (and subsequently built references) to avoid unnecessarily calculating coverage in those bins during future runs.
- Convert between any of the tabular data formats supported by skgenome.tabio, including BED and UCSC RefFlat (e.g. refFlat.txt from UCSC Genome Bioinformatics).