The pbcore.io package provides a number of lightweight interfaces to PacBio data files and other standard bioinformatics file formats. Preferred usage is to import classes directly from the pbcore.io package, e.g.:
>>> from pbcore.io import CmpH5Reader
The classes within pbcore.io adhere to a few conventions, in order to provide a uniform API:
Each data file type is thought of as a container of a Record type; all Reader classes support streaming access, and CmpH5Reader and BasH5Reader additionally provide random-access to alignments/reads.
The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by ”.gz” for all but the h5 file types) or an open file handle. The reader/writer classes will do what you would expect.
The reader/writer classes all support the context manager idiom. Meaning, if you write:
>>> with CmpH5Reader("aligned_reads.cmp.h5") as r: ... print r[0].read()the CmpH5Reader object will be automatically closed after the block within the “with” statement is executed.
The bas.h5 file format is a container format for PacBio reads, built on top of the HDF5 standard.
Note
In contrast to GFF, for example, the bas.h5 read coordinate system is 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
The cmp.h5 file format is an alignment format built on top of the HDF5 standard. It is a simple container format for PacBio alignment records.
Note
In contrast to GFF, for example, all cmp.h5 coordinate systems (refererence, read) are 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
FASTA is a standard format for sequence data.
FASTQ is a standard format for sequence data with associated quality scores.
The GFF format is an open and flexible standard for representing genomic features.