The relationship between genes’ copy number and mRNA expression varies across the genome. For a subset of genes, mostly housekeeping genes, the mRNA expression levels measured by transcriptome sequencing are mostly explained by underlying the genic regions’ genomic copy number. CNVkit can use this information to estimate coarse-grained copy number from RNA sequencing of a process-matched cohort of samples.
Samples are processed simultaneously as a cohort, and two additional input files are needed to complete the processing pipeline:
- Gene info – a table of per-transcript and per-gene metadata exported from
Ensembl BioMart. A snapshot of this file for the human transcriptome is
bundled with CNVkit under
- CNV-expression correlation coefficients – calculated per gene via the
cnv_expression_correlate.pyincluded with CNVkit. A pre-calculated set of coefficients calculated on TCGA melanoma datasets, obtained from cBioPortal, is bundled with CNVkit under
With the above files, the RNA analysis can be run with either of the following commands:
Use like this:
cnvkit.py import-rna [ *_rsem.genes.results | *.txt ] \ --gene-resource data/ensembl-gene-info.hg38.tsv \ --correlations data/tcga-skcm.cnv-expr-corr.tsv \ --output out-summary.tsv --output-dir out/
Each gene’s read counts and average transcript length are taken from the input
file for each sample. Normalized, bias-corrected
*.cnr files are written to
--output-dir, and an optional summary table with all samples’ data is
Input file sources:
- RSEM: The third-party RNA quantification program RSEM produces
*_rsem.genes.resultsoutput files that can be used as input to the
- Gene counts: Alternatively, the gene Ensembl IDs and per-gene read counts can be read from a simple 2-column, tab-delimited file. This format is used by TCGA level 2 RNA expression data. You can also create the equivalent on your own from the output of another RNA quantification tool like Salmon or Kallisto.
*.cnr files can be used with CNVkit’s segment,
scatter, and other commands similarly to the
*.cnr files generated
from DNA sequencing data. Each gene is represented by a single bin, and bins
may be overlapping or even nested.
cbs segmentation method performs reasonably well to produce a
coarse-grained segmentation of the file. The
hmm segmentation method also
When using the scatter command to plot these files, note that the bin
(gene-level) weights are particularly important, and are visually represented by
circle size. A smoothed trendline (
--trend) can be helpful to
supplement the coarse-grained segmentation.
- Input samples should be process-matched and ideally of the same source tissue type. The DNA source for each sample can be single cells or bulk tissue.
- The cohort size should be at least 5, preferably at least 10, samples.
--correlationsinput is not required but is strongly recommended. The TCGA melanoma cohort correlations can be used for analysis of any tissue type, not just neoplastic melanocytes. However, best results will usually be achieved with a correlations table specific to the test cohort. The script
cnv_expression_correlate.pygenerates this table from input tables of per-gene and per-sample copy number and expression levels, typically retrieved from cBioPortal for TCGA cancer-specific cohorts.