scikit-genome package

Module skgenome contents

Tabular file I/O (tabio)

tabio

I/O for tabular formats of genomic data (regions or features).

skgenome.tabio.get_filename(infile)[source]
skgenome.tabio.read(infile, fmt='tab', into=None, sample_id=None, meta=None, **kwargs)[source]

Read tabular data from a file or stream into a genome object.

Supported formats: see READERS

If a format supports multiple samples, return the sample specified by sample_id, or if unspecified, return the first sample and warn if there were other samples present in the file.

Parameters:
  • infile (handle or string) – Filename or opened file-like object to read.
  • fmt (string) – File format.
  • into (class) – GenomicArray class or subclass to instantiate, overriding the default for the target file format.
  • sample_id (string) – Sample identifier.
  • meta (dict) – Metadata, as arbitrary key-value pairs.
  • **kwargs – Additional keyword arguments to the format-specific reader function.
Returns:

The data from the given file instantiated as into, if specified, or the default base class for the given file format (usually GenomicArray).

Return type:

GenomicArray or subclass

skgenome.tabio.read_auto(infile)[source]

Auto-detect a file’s format and use an appropriate parser to read it.

skgenome.tabio.safe_write(outfile, verbose=True)[source]

Write to a filename or file-like object with error handling.

If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.

skgenome.tabio.sniff_region_format(infile)[source]

Guess the format of the given file by reading the first line.

Returns:The detected format name, or None if the file is empty.
Return type:str or None
skgenome.tabio.write(garr, outfile=None, fmt='tab', verbose=True, **kwargs)[source]

Write a genome object to a file or stream.

Base class: GenomicArray

The base class of the core objects used throughout CNVkit and scikit-genome is GenomicArray. It wraps a pandas DataFrame instance, which is accessible through the .data attribute and can be used for any manipulations that aren’t already provided by methods in the wrapper class.

gary

Base class for an array of annotated genomic regions.

class skgenome.gary.GenomicArray(data_table, meta_dict=None)[source]

Bases: object

An array of genomic intervals. Base class for genomic data structures.

Can represent most BED-like tabular formats with arbitrary additional columns.

add(other)[source]

Combine this array’s data with another GenomicArray (in-place).

Any optional columns must match between both arrays.

add_columns(**columns)[source]

Add the given columns to a copy of this GenomicArray.

Parameters:**columns (array) – Keyword arguments where the key is the new column’s name and the value is an array of the same length as self which will be the new column’s values.
Returns:A new instance of self with the given columns included in the underlying dataframe.
Return type:GenomicArray or subclass
as_columns(**columns)[source]

Wrap the named columns in this instance’s metadata.

as_dataframe(dframe, reset_index=False)[source]

Wrap the given pandas DataFrame in this instance’s metadata.

as_rows(rows)[source]

Wrap the given rows in this instance’s metadata.

as_series(arraylike)[source]
autosomes(also=())[source]

Select chromosomes w/ integer names, ignoring any ‘chr’ prefixes.

by_arm(min_gap_size=100000.0, min_arm_bins=50)[source]

Iterate over bins grouped by chromosome arm (inferred).

by_chromosome()[source]

Iterate over bins grouped by chromosome name.

by_ranges(other, mode='outer', keep_empty=True)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

For example, this can be used to group SNVs by CNV segments.

Bins in this array that fall outside the other array’s bins are skipped.

Parameters:
  • other (GenomicArray) – Another GA instance.
  • mode (string) –

    Determines what to do with bins that overlap a boundary of the selection. Possible values are:

    • inner: Drop the bins on the selection boundary, don’t emit them.
    • outer: Keep/emit those bins as they are.
    • trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
  • keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields:

tuple – (other bin, GenomicArray of overlapping rows in self)

chromosome
concat(others)[source]

Concatenate several GenomicArrays, keeping this array’s metadata.

This array’s data table is not implicitly included in the result.

coords(also=())[source]

Iterate over plain coordinates of each bin: chromosome, start, end.

Parameters:
  • also (str, or iterable of strings) – Also include these columns from self, in addition to chromosome, start, and end.
  • yielding rows in BED format (Example,) –
  • probes.coords(also=["gene", "strand"]) (>>>) –
copy()[source]

Create an independent copy of this object.

cut(other, combine=None)[source]

Split this array’s regions at the boundaries in other.

drop_extra_columns()[source]

Remove any optional columns from this GenomicArray.

Returns:A new copy with only the minimal set of columns required by the class (e.g. chromosome, start, end for GenomicArray; may be more for subclasses).
Return type:GenomicArray or subclass
end
filter(func=None, **kwargs)[source]

Take a subset of rows where the given condition is true.

Parameters:
  • func (callable) – A boolean function which will be applied to each row to keep rows where the result is True.
  • **kwargs (string) – Keyword arguments like chromosome="chr7" or gene="Antitarget", which will keep rows where the keyed field equals the specified value.
Returns:

Subset of self where the specified condition is True.

Return type:

GenomicArray

flatten(combine=None, split_columns=None)[source]

Split this array’s regions where they overlap.

classmethod from_columns(columns, meta_dict=None)[source]

Create a new instance from column arrays, given as a dict.

classmethod from_rows(rows, columns=None, meta_dict=None)[source]

Create a new instance from a list of rows, as tuples or arrays.

in_range(chrom=None, start=None, end=None, mode='outer')[source]

Get the GenomicArray portion within the given genomic range.

Parameters:
  • chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
  • start (int or None) – Start coordinate of range to select, in 0-based coordinates. If None, start from 0.
  • end (int or None) – End coordinate of range to select. If None, select to the end of the chromosome.
  • mode (str) – As in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.
Returns:

The subset of self enclosed by the specified range.

Return type:

GenomicArray

in_ranges(chrom=None, starts=None, ends=None, mode='outer')[source]

Get the GenomicArray portion within the specified ranges.

Similar to in_ranges, but concatenating the selections of all the regions specified by the starts and ends arrays.

Parameters:
  • chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
  • starts (int array, or None) – Start coordinates of ranges to select, in 0-based coordinates. If None, start from 0.
  • ends (int array, or None) – End coordinates of ranges to select. If None, select to the end of the chromosome. If starts and ends are both specified, they must be arrays of equal length.
  • mode (str) – As in by_ranges: outer includes bins straddling the range boundaries, trim additionally alters the straddling bins’ endpoints to match the range boundaries, and inner excludes those bins.
Returns:

Concatenation of all the subsets of self enclosed by the specified ranges.

Return type:

GenomicArray

intersection(other, mode='outer')[source]

Select the bins in self that overlap the regions in other.

The extra fields of self, but not other, are retained in the output.

into_ranges(other, column, default, summary_func=None)[source]

Re-bin values from column into the corresponding ranges in other.

Match overlapping/intersecting rows from other to each row in self. Then, within each range in other, extract the value(s) from column in self, using the function summary_func to produce a single value if multiple bins in self map to a single range in other.

For example, group SNVs (self) by CNV segments (other) and calculate the median (summary_func) of each SNV group’s allele frequencies.

Parameters:
  • other (GenomicArray) – Ranges into which the overlapping values of self will be summarized.
  • column (string) – Column name in self to extract values from.
  • default – Value to assign to indices in other that do not overlap any bins in self. Type should be the same as or compatible with the output field specified by column, or the output of summary_func.
  • summary_func (callable, dict of string-to-callable, or None) –

    Specify how to reduce 1 or more other rows into a single value for the corresponding row in self.

    • If callable, apply to the column field each group of rows in other column.
    • If a single-element dict of column name to callable, apply to that field in other instead of column.
    • If None, use an appropriate summarizing function for the datatype of the column column in other (e.g. median of numbers, concatenation of strings).
    • If some other value, assign that value to self wherever there is an overlap.
Returns:

The extracted and summarized values from self corresponding to other’s genomic ranges, the same length as other.

Return type:

pd.Series

iter_ranges_of(other, column, mode='outer', keep_empty=True)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

For example, this can be used to group SNVs by CNV segments.

Bins in this array that fall outside the other array’s bins are skipped.

Parameters:
  • other (GenomicArray) – Another GA instance.
  • column (string) – Column name in self to extract values from.
  • mode (string) –

    Determines what to do with bins that overlap a boundary of the selection. Possible values are:

    • inner: Drop the bins on the selection boundary, don’t emit them.
    • outer: Keep/emit those bins as they are.
    • trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
  • keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
Yields:

tuple – (other bin, GenomicArray of overlapping rows in self)

keep_columns(colnames)[source]

Extract a subset of columns, reusing this instance’s metadata.

labels()[source]
merge(bp=0, stranded=False, combine=None)[source]

Merge adjacent or overlapping regions into single rows.

Similar to ‘bedtools merge’.

resize_ranges(bp, chrom_sizes=None)[source]

Resize each genomic bin by a fixed number of bases at each end.

Bin ‘start’ values have a minimum of 0, and chrom_sizes can specify each chromosome’s maximum ‘end’ value.

Similar to ‘bedtools slop’.

Parameters:
  • bp (int) – Number of bases in each direction to expand or shrink each bin. Applies to ‘start’ and ‘end’ values symmetrically, and may be positive (expand) or negative (shrink).
  • chrom_sizes (dict of string-to-int) – Chromosome name to length in base pairs. If given, all chromosomes in self must be included.
sample_id
shuffle()[source]

Randomize the order of bins in this array (in-place).

sort()[source]

Sort this array’s bins in-place, with smart chromosome ordering.

sort_columns()[source]

Sort this array’s columns in-place, per class definition.

squash(combine=None)[source]

Combine some groups of rows, by some criteria, into single rows.

start
subdivide(avg_size, min_size=0, verbose=False)[source]

Split this array’s regions into roughly equal-sized sub-regions.

subtract(other)[source]

Remove the overlapping regions in other from this array.

total_range_size()[source]

Total number of bases covered by all (merged) regions.

Genomic interval arithmetic

intersect

DataFrame-level intersection operations.

Calculate overlapping regions, similar to bedtools intersect.

The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.

skgenome.intersect.by_ranges(table, other, mode, keep_empty)[source]

Group rows by another GenomicArray’s bin coordinate ranges.

skgenome.intersect.by_shared_chroms(table, other, keep_empty=True)[source]
skgenome.intersect.idx_ranges(table, starts, ends, mode)[source]

Iterate through sub-ranges.

skgenome.intersect.into_ranges(source, dest, src_col, default, summary_func)[source]

Group a column in source by regions in dest and summarize.

skgenome.intersect.iter_ranges(table, chrom, starts, ends, mode)[source]

Iterate through sub-ranges.

skgenome.intersect.iter_slices(table, other, mode, keep_empty)[source]

Yields indices to extract ranges from table.

Returns an iterable of integer arrays that can apply to Series objects, i.e. columns of table. These indices are of the DataFrame/Series’ Index, not array coordinates – so be sure to use DataFrame.loc, Series.loc, or Series getitem, as opposed to .iloc or indexing directly into Numpy arrays.

skgenome.intersect.venn(table, other, mode)[source]

merge

DataFrame-level merging operations.

Merge overlapping regions into single rows, similar to bedtools merge.

The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.

skgenome.merge.flatten(table, combine=None, split_columns=None)[source]
skgenome.merge.merge(table, bp=0, stranded=False, combine=None)[source]

Merge overlapping rows in a DataFrame.

subdivide

DataFrame-level subdivide operation.

Split each region into similar-sized sub-regions.

The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.

skgenome.subdivide.subdivide(table, avg_size, min_size=0, verbose=False)[source]

subtract

DataFrame-level subtraction operations.

Subtract one set of regions from another, returning the one-way difference.

The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.

skgenome.subtract.subtract(table, other)[source]

Helper modules

chromsort

Operations on chromosome/contig/sequence names.

skgenome.chromsort.detect_big_chroms(sizes)[source]

Determine the number of “big” chromosomes from their lengths.

In the human genome, this returns 24, where the canonical chromosomes 1-22, X, and Y are considered “big”, while mitochrondria and the alternative contigs are not. This allows us to exclude the non-canonical chromosomes from an analysis where they’re not relevant.

Returns:
  • n_big (int) – Number of “big” chromosomes in the genome.
  • thresh (int) – Length of the smallest “big” chromosomes.
skgenome.chromsort.sorter_chrom(label)[source]

Create a sorting key from chromosome label.

Sort by integers first, then letters or strings. The prefix “chr” (case-insensitive), if present, is stripped automatically for sorting.

E.g. chr1 < chr2 < chr10 < chrX < chrY < chrM

combiners

Combiner functions for Python list-like input.

skgenome.combiners.first_of(elems)[source]

Return the first element of the input.

skgenome.combiners.get_combiners(table, stranded=False, combine=None)[source]

Get a combine lookup suitable for table.

Parameters:
  • table (DataFrame) –
  • stranded (bool) –
  • combine (dict or None) – Column names to their value-combining functions, replacing or in addition to the defaults.
Returns:

Column names to their value-combining functions.

Return type:

dict

skgenome.combiners.join_strings(elems, sep=', ')[source]

Join a Series of strings by commas.

skgenome.combiners.last_of(elems)[source]

Return the last element of the input.

skgenome.combiners.make_const(val)[source]
skgenome.combiners.merge_strands(elems)[source]

rangelabel

Handle text genomic ranges as named tuples.

A range specification should look like chromosome:start-end, e.g. chr1:1234-5678, with 1-indexed integer coordinates. We also allow chr1:1234- or chr1:-5678, where missing start becomes 0 and missing end becomes None.

class skgenome.rangelabel.NamedRegion(chromosome, start, end, gene)

Bases: tuple

chromosome

Alias for field number 0

end

Alias for field number 2

gene

Alias for field number 3

start

Alias for field number 1

class skgenome.rangelabel.Region(chromosome, start, end)

Bases: tuple

chromosome

Alias for field number 0

end

Alias for field number 2

start

Alias for field number 1

skgenome.rangelabel.from_label(text, keep_gene=True)[source]

Parse a chromosomal range specification.

Parameters:text (string) – Range specification, which should look like chr1:1234-5678 or chr1:1234- or chr1:-5678, where missing start becomes 0 and missing end becomes None.
skgenome.rangelabel.to_label(row)[source]

Convert a Region or (chrom, start, end) tuple to a region label.

skgenome.rangelabel.unpack_range(a_range)[source]

Extract chromosome, start, end from a string or tuple.

Examples:

"chr1" -> ("chr1", None, None)
"chr1:100-123" -> ("chr1", 99, 123)
("chr1", 100, 123) -> ("chr1", 100, 123)