Whole-genome sequencing and targeted amplicon capture¶

CNVkit is primarily designed for use on hybrid capture sequencing data, where off-target reads are present and can be used improve copy number estimates. However, CNVkit can also be used on whole-genome sequencing (WGS) and targeted amplicon sequencing (TAS) datasets by using alternative command-line options.

The batch command supports these workflows through the -m/--method option.

Whole-Genome Sequencing (WGS)¶

CNVkit treats WGS data as a capture of all of the genome’s sequencing-accessible regions, with no off-target regions.

The batch --method wgs option uses the given reference genome’s sequencing-accessible regions (“access” BED) as the “targets” – these will be calculated on the fly if not provided. No “antitarget” regions are used. Since the input does not contain useful per-target gene labels, a gene annotation database is required and used to label genes in the outputs:

cnvkit.py batch -m wgs -g data/access-5kb-mappable.hg19.bed --annotate refFlat.txt *.bam

Equivalently:

cnvkit.py target data/access-5kb-mappable.hg19.bed --split --short-names --annotate refFlat.txt -o targets.bed
# For each sample
cnvkit.py coverage Sample.bam targets.bed -p 0 -o Sample.targetcoverage.cnn
# Create an empty antitarget coverage file, header only
head -n1 Sample.targetcoverage.cnn -o Sample.antitargetcoverage.cnn
cnvkit.py reference *.targetcoverage.cnn *.antitargetcoverage.cnn -o ref-wgs.cnn
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn ref-wgs.cnn --no-edge

To speed up WGS analyses, try any or all of the following:

Instead of analyzing the whole genome, use the “access” or “target” BED file to limit the analysis to just the genic regions. You can get such a BED file from the [UCSC Genome Browser](https://genome.ucsc.edu/cgi-bin/hgTables), for example.
Increase the “target” average bin size to 500 or 1000 bases.
Specify a smaller p-value threshold (segment -t). For the CBS method, 1e-6 may work well.
Use the -p/--processes option in the batch, coverage and segment commands to ensure all available CPUs are used.
Ensure you are using the most recent version of CNVkit. Each release includes some performance improvements.

The batch -m wgs option does all of these except the first automatically.

Targeted Amplicon Sequencing (TAS)¶

When amplicon sequencing is used as a targeted capture method, no off-target reads are sequenced. While this limits the copy number information available in the sequencing data versus hybrid capture, CNVkit can analyze TAS data using only on-target coverages and excluding all off-target regions from the analysis.

The batch -m amplicon option uses the given targets to infer coverage, and leaves the antitarget coverage file empty:

cnvkit.py batch -m amplicon -t targets.bed *.bam

Equivalently:

cnvkit.py target targets.bed --split -o targets.split.bed
# For each sample
cnvkit.py coverage Sample.bam targets.split.bed -p 0 -o Sample.targetcoverage.cnn
# Create an empty antitarget coverage file, header only
head -n1 Sample.targetcoverage.cnn -o Sample.antitargetcoverage.cnn
cnvkit.py reference *.targetcoverage.cnn *.antitargetcoverage.cnn -o ref-tas.cnn
cnvkit.py fix Sample.targetcoverage.cnn Sample.antitargetcoverage.cnn ref-tas.cnn --no-edge

This approach does not collect any copy number information between targeted regions, so it should only be used if you have in fact prepared your samples with a targeted amplicon sequencing protocol. It also does not attempt to normalize each amplicon at the gene level, though this may be addressed in a future version of CNVkit.