Sequence Data (pyncbitk.objects.seqdata)#

Actual sequence data of biological sequences.

The NCBI C++ Toolkit provides a unified API for encoding the actual data of protein and nucleotide sequences. It supports textual representation, ordinal encoding (replacing A, C, G, T with 1, 2, 3, 4), as well as compressed bitmap representations.

See also

The Data Model chapter of the NCBI C++ Toolkit documentation.

Base Classes#

class pyncbitk.objects.seqdata.SeqData(Serial)#

An abstract base storage of sequence data.

classmethod __new__(*args, **kwargs)#
__init__(*args, **kwargs)#
__reduce__()#

Helper for pickle.

copy(pack=False)#

Create a copy of the sequence data.

Arguments
pack (bool): Whether to perform additional packing of the

data stored in the copy, allowing a more compact representation but potentially changing the SeqData subtype.

class pyncbitk.objects.seqdata.SeqAaData(SeqData)#

An abstract base storage of amino-acid sequence data.

classmethod __new__(*args, **kwargs)#
classmethod encode(data)#

Encode the textual sequence to a compressed representation.

Parameters:

data (str, bytes, or buffer-like object) – The ASCII nucleotide sequence to be encoded. Python strings, and any other object supporting the buffer protocol is supported.

__reduce__()#

Helper for pickle.

decode()#

Decode the contents of the sequence data.

Returns:

str – The decoded sequence data, as a Python string of NCBI-extended amino-acid symbols.

class pyncbitk.objects.seqdata.SeqNaData(SeqData)#

An abstract base storage of nucleotide sequence data.

classmethod __new__(*args, **kwargs)#
classmethod encode(data)#

Encode the textual sequence to a compressed representation.

Parameters:

data (str, bytes, or buffer-like object) – The ASCII nucleotide sequence to be encoded. Python strings, and any other object supporting the buffer protocol is supported.

Raises:

ValueError – When the sequence data contains invalid characters that do not belong to the nucleotide alphabet.

__reduce__()#

Helper for pickle.

decode()#

Decode the contents of the sequence data.

Returns:

str – The decoded sequence data, as a Python string of IUPAC nucleotide symbols.

Nucleotide Data#

class pyncbitk.objects.seqdata.IupacNaData(SeqNaData)#

Nucleotide sequence data stored as a IUPAC nucleotide string.

Example

>>> seqdata = IupacNaData("ATTAGCCATGCATA")
>>> seqdata.length
14
>>> seqdata.data
b'ATTAGCCATGCATA'
classmethod __new__(*args, **kwargs)#
classmethod encode(data)#

Encode the textual sequence to a compressed representation.

Parameters:

data (str, bytes, or buffer-like object) – The ASCII nucleotide sequence to be encoded. Python strings, and any other object supporting the buffer protocol is supported.

Raises:

ValueError – When the sequence data contains invalid characters that do not belong to the nucleotide alphabet.

__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__eq__(value, /)#

Return self==value.

__ge__(value, /)#

Return self>=value.

__gt__(value, /)#

Return self>value.

__init__(*args, **kwargs)#
__le__(value, /)#

Return self<=value.

__lt__(value, /)#

Return self<value.

__ne__(value, /)#

Return self!=value.

__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

decode()#

Decode the contents of the sequence data.

Returns:

str – The decoded sequence data, as a Python string of IUPAC nucleotide symbols.

data#

The sequence data as ASCII bytes.

Type:

bytes

length#

The length of the sequence data.

Type:

int

class pyncbitk.objects.seqdata.Ncbi2NaData(SeqNaData)#

Nucleotide sequence data stored with 2-bit encoding.

A nucleic acid containing no ambiguous bases can be encoded using a two-bit encoding per base, representing one of the four nucleobases: A, C, G or T. This encoding is the most compact for unambiguous sequences.

Example

>>> seqdata = Ncbi2NaData.encode("ATTAGCCATGCATA")
>>> seqdata.data
b'<\x94\xe4\xc0'
classmethod __new__(*args, **kwargs)#
__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__init__(*args, **kwargs)#
__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

data#

The sequence data in 2-bit encoded format.

Type:

bytes

class pyncbitk.objects.seqdata.Ncbi4NaData(SeqNaData)#

Nucleotide sequence data stored with 4-bit encoding.

classmethod __new__(*args, **kwargs)#
__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__init__(*args, **kwargs)#
__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

data#

The sequence data in 4-bit encoded format.

Type:

bytes

class pyncbitk.objects.seqdata.Ncbi8NaData(SeqNaData)#
classmethod __new__(*args, **kwargs)#
__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__init__(*args, **kwargs)#
__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

data#

The sequence data in 8-bit encoded format.

Type:

bytes

class pyncbitk.objects.seqdata.NcbiPNaData(SeqNaData)#

Nucleotide sequence data storing probabilities for each position.

classmethod __new__(*args, **kwargs)#
__reduce__()#

Helper for pickle.

Protein Data#

class pyncbitk.objects.seqdata.IupacAaData(SeqAaData)#

Nucleotide sequence data stored in a IUPAC-UBI amino-acid string.

The IUPAC-IUB Commission on Biochemical Nomenclature defined a code of one-letter abbreviations for the 20 standard amino-acids, as well as undeterminate and unknown symbols.

References

  • IUPAC-IUB Commission on Biochemical Nomenclature. “A One-Letter Notation for Amino Acid Sequences” 1–3. (1968). Journal of Biological Chemistry, 243(13), 3557–3559. doi:10.1016/S0021-9258(19)34176-6.

classmethod __new__(*args, **kwargs)#
classmethod encode(data)#

Encode the textual sequence to a compressed representation.

Parameters:

data (str, bytes, or buffer-like object) – The ASCII nucleotide sequence to be encoded. Python strings, and any other object supporting the buffer protocol is supported.

__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__init__(*args, **kwargs)#
__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

decode()#

Decode the contents of the sequence data.

Returns:

str – The decoded sequence data, as a Python string of NCBI-extended amino-acid symbols.

data#

The sequence data as ASCII bytes.

Type:

bytes

length#

The length of the sequence data.

Type:

int

class pyncbitk.objects.seqdata.Ncbi8AaData(SeqNaData)#

Amino-acid sequence data with support for modified residues.

classmethod __new__(*args, **kwargs)#
__reduce__()#

Helper for pickle.

class pyncbitk.objects.seqdata.NcbiEAaData(IupacAaData)#

Amino-acid sequence data storing an NCBI-extended string.

This representation adds symbols for the non-standard selenocysteine amino-acid (U) as well as support for termination or gap characters.

classmethod __new__(*args, **kwargs)#
classmethod encode(data)#

Encode the textual sequence to a compressed representation.

Parameters:

data (str, bytes, or buffer-like object) – The ASCII nucleotide sequence to be encoded. Python strings, and any other object supporting the buffer protocol is supported.

__buffer__(flags, /)#

Return a buffer object that exposes the underlying memory of the object.

__init__(*args, **kwargs)#
__reduce_ex__(protocol)#

Helper for pickle.

__repr__()#

Return repr(self).

decode()#

Decode the contents of the sequence data.

Returns:

str – The decoded sequence data, as a Python string of NCBI-extended amino-acid symbols.

data#

The sequence data as ASCII bytes.

Type:

bytes

length#

The length of the sequence data.

Type:

int

class pyncbitk.objects.seqdata.NcbiPAaData(SeqNaData)#

Amino-acid sequence data storing probabilities for each position.

classmethod __new__(*args, **kwargs)#
__reduce__()#

Helper for pickle.

class pyncbitk.objects.seqdata.NcbiStdAa(SeqNaData)#

Amino-acid sequence data stored as ordinal encoding.

This encoding represents the NCBI-extended amino-acids as consecutive integer values, starting with 0 for the gap character.

classmethod __new__(*args, **kwargs)#
__reduce__()#

Helper for pickle.

Gaps#

class pyncbitk.objects.seqdata.GapData(SeqData)#

A virtual sequence data storage representing a gap.

classmethod __new__(*args, **kwargs)#
__reduce__()#

Helper for pickle.