If you would like to quickly try CNVkit without installing it, try our app on DNAnexus.
To run CNVkit on your own machine, keep reading.
Download the source code from GitHub:
And read the README file.
Download the reference genome¶
Go to the UCSC Genome Bioinformatics website and download:
- Your species’ reference genome sequence, in FASTA format [required]
- Gene annotation database, via RefSeq or Ensembl, in “flat” format (e.g. refFlat.txt) [optional]
You probably already have the reference genome sequence. If your species’ genome is not available from UCSC, use whatever reference sequence you have. CNVkit only requires that your reference genome sequence be in FASTA format. Both the reference genome sequence and the annotation database must be single, uncompressed files.
If your reference genome is the UCSC human genome hg19, a BED file of the
sequencing-accessible regions is included in the CNVkit distribution as
If you’re not using hg19, consider building the “access” file yourself from your
reference genome sequence (say,
mm10.fasta) using the access
cnvkit.py access mm10.fasta -s 10000 -o access-10kb.mm10.bed
We’ll use this file in the next step to ensure off-target bins (“antitargets”) are allocated only in chromosomal regions that can be mapped.
Gene annotations: The gene annotations file (refFlat.txt) is useful to apply gene names to your baits BED file, if the BED file does not already have short, informative names for each bait interval. This file can be used in the next step.
If your targets look like:
chr1 1508981 1509154 chr1 2407978 2408183 chr1 2409866 2410095
Then you want refFlat.txt.
Otherwise, if they look like:
chr1 1508981 1509154 SSU72 chr1 2407978 2408183 PLCH2 chr1 2409866 2410095 PLCH2
Then you don’t need refFlat.txt.
Map sequencing reads to the reference genome¶
If you haven’t done so already, use a sequence mapping/alignment program such as BWA to map your sequencing reads to the reference genome sequence.
You should now have one or BAM files corresponding to individual samples.
Build a reference from normal samples and infer tumor copy ratios¶
Here we’ll assume the BAM files are a collection of “tumor” and “normal” samples, although germline disease samples can be used equally well in place of tumor samples.
CNVkit uses the bait BED file (provided by the vendor of your capture kit), reference genome sequence, and sequencing-accessible regions along with your BAM files to:
- Create a pooled reference of per-bin copy number estimates from several normal samples; then
- Use this reference in processing all tumor samples that were sequenced with the same platform and library prep.
All of these steps are automated with the batch command. Assuming normal samples share the suffix “Normal.bam” and tumor samples “Tumor.bam”, a complete command could be:
cnvkit.py batch *Tumor.bam --normal *Normal.bam \ --targets my_baits.bed --fasta hg19.fasta \ --split --access data/access-5kb-mappable.hg19.bed \ --output-reference my_reference.cnn --output-dir example/
See the built-in help message to see what these options do, and for additional options:
cnvkit.py batch -h
If you have no normal samples to use for the reference, you can create a “flat”
reference which assumes equal coverage in all bins by using the
flag without specifying any additional BAM files:
cnvkit.py batch *Tumor.bam -n -t my_baits.bed -f hg19.fasta \ --split --access data/access-5kb-mappable.hg19.bed \ --output-reference my_flat_reference.cnn -d example2/
In either case, you should run this command with the reference genome sequence FASTA file to extract GC and RepeatMasker information for bias corrections, which enables CNVkit to improve the copy ratio estimates even without a paired normal sample.
If your targets are missing gene names, you can add them here with the
cnvkit.py batch *Tumor.bam -n *Normal.bam -t my_baits.bed -f hg19.fasta \ --annotate refFlat.txt --split --access data/access-5kb-mappable.hg19.bed \ --output-reference my_flat_reference.cnn -d example3/
Process more tumor samples¶
You can reuse the reference file you’ve previously constructed to extract copy
number information from additional tumor sample BAM files, without repeating the
Assuming the new tumor samples share the suffix “Tumor.bam” (and let’s also
spread the workload across all available CPUs with the
-p option, and
generate some figures):
cnvkit.py batch *Tumor.bam -r my_reference.cnn -p 0 --scatter --diagram -d example4/
The coordinates of the target and antitarget bins, the gene names for the targets, and the GC and RepeatMasker information for bias corrections are automatically extracted from the reference .cnn file you’ve built.