BLAST with the Object Manager#

In the previous section, we introduced how to run BLAST queries using by loading the data in BioSeq objects, and then using the Blast.run to get alignments. This approach presents limitations: every sequence we pass to BLAST needs to be a ContinuousInst, and we cannot use the full flexibility offered by the NCBI C++ Toolkit data model.

The object manager and scopes#

The object manager is a singleton that lives for the whole process that imports pyncbitk. It enables the creation of scopes, which allows the management of objects for a given code region. In Python, scopes implement the context manager protocol, which allows controlling when objects can be safely discarded when they go out of scope. Scopes can be created from an ObjectManager object, which can always be instantiated for free:

[1]:

from pyncbitk.objmgr import ObjectManager

with ObjectManager().scope() as scope:
    pass

The add method of the Scope object can be used to add a BioSeq to the current scope. Let’s create a test sequence we can add to the scope.

[2]:

from pyncbitk.objects.seq import BioSeq
from pyncbitk.objects.seqid import LocalId
from pyncbitk.objects.general import ObjectId
from pyncbitk.objects.seqdata import IupacNaData
from pyncbitk.objects.seqinst import ContinuousInst

inst = ContinuousInst(IupacNaData("ATGC"))
seqid = LocalId(ObjectId("seq1"))
seq = BioSeq(inst, seqid)

[3]:

with ObjectManager().scope() as scope:
    handle = scope.add_bioseq(seq)
    print(handle)

<pyncbitk.objmgr.BioSeqHandle object at 0x77b0fc2ee070>

The resulting object we get after adding a BioSeq to a scope is a BioSeqHandle: a handle which acts as a proxy to a BioSeq, allowing to retrieve the attributes of a sequence managed by the scope. Note that you shouldn’t use a handle obtained from a scope after that scope was closed, as the data may have been released already. Just like proxy and ref objects in Python, calling the handle will load the original sequence.

[4]:

with ObjectManager().scope() as scope:
    handle = scope.add_bioseq(seq)
    print(handle())

BioSeq(ContinuousInst(IupacNaData(b'ATGC'), strandedness='double', molecule='dna', length=4), LocalId(ObjectId('seq1')))

You can also obtain a BioSeqHandle by indexing the scope with a SeqId object, which allows retrieving data in a scope.

[5]:

with ObjectManager().scope() as scope:
    scope.add_bioseq(seq)

    handle = scope[seqid]
    print(handle())

BioSeq(ContinuousInst(IupacNaData(b'ATGC'), strandedness='double', molecule='dna', length=4), LocalId(ObjectId('seq1')))

Efficient BLAST data storage#

The main advantage of using the object manager for BLAST queries is that the sequence storage is much more flexible. In the previous example, we loaded the BioSeq using a FastaReader with the split=False parameter. This caused the FastaReader to only produce sequences where the instance was a ContinuousInst, i.e. a single long sequence encoded in an ASCII string, potentially wasting space for unknown regions. However, now that we are gonna use the object manager, we can load the sequence in the best instance type:

[6]:

import pathlib
from pyncbitk.objtools import FastaReader

k12 = FastaReader(pathlib.Path("data").joinpath("LN832404.fna")).read()
o157 = FastaReader(pathlib.Path("data").joinpath("AE014075.fna")).read()

type(k12.instance)

[6]:

pyncbitk.objects.seqinst.DeltaInst

The BioSeq we just loaded is now instantiated as a DeltaInst, which combines different sequence blocks that can each be in different encodings. This way, the sequence regions without ambiguity can be stored in Ncbi2NaData objects with 2-bit encoding, effectively saving space and improving data throughput.

Scoped data queries with BLAST#

Let’s make a global scope for the rest of this example, and add our sequences to it:

[7]:

scope = ObjectManager().scope()
scope.add_bioseq(k12)
scope.add_bioseq(o157)

[7]:

<pyncbitk.objmgr.BioSeqHandle at 0x77b0fc2cb0c0>

Once we have added the sequence data to the scope, we can refer to our query and subject sequences not in term of BioSeq objects, which actually contain the sequence data, but simply in terms of sequence location (SeqLoc objects) which describe a location in a sequence found in the scope. To use the two complete sequences, we can describe the locations as WholeSeqLoc objects:

[8]:

from pyncbitk.objects.seqloc import WholeSeqLoc
query_loc = WholeSeqLoc(k12.id)
subject_loc = WholeSeqLoc(o157.id)

When we have the locations of our sequences, we need to wrap them into a pyncbitk.algo.blast.SearchQuery, a dedicated object which stores a sequence location and the scope where this location can be resolved. If you have more than one query or subject, you can in turn wrap them into a pyncbitk.algo.blast.SearchQueryVector:

[9]:

from pyncbitk.algo.blast import SearchQuery
query_sq = SearchQuery(query_loc, scope)
subject_sq = SearchQuery(subject_loc, scope)

You can now run the BLAST search the same as before, passing SearchQuery and SearchQueryVector instead of BioSeq and BioSeqSet to use scoped data:

[10]:

from pyncbitk.algo.blast import BlastN
blastn = BlastN(evalue=1e-5)
results = blastn.run(query_sq, subject_sq)
print(len(results[0].alignments))

Building complex queries#

Now that the data is loaded in the local scope, it is really easy to run more complex queries about regions of each sequence without having to copy data around. For instance, to limit the homology search to the first 10,000 bases of each sequences, we could use a SeqIntervalLoc:

[11]:

from pyncbitk.objects.seqloc import SeqIntervalLoc
query_loc = SeqIntervalLoc(k12.id, 0, 9999)
subject_loc = SeqIntervalLoc(o157.id, 0, 9999)
results = blastn.run(SearchQuery(query_loc, scope), SearchQuery(subject_loc, scope))
print(len(results[0].alignments))

Otherwise, we can also create a sequence that references the K12 genome without having to copy the data around. For instance, thrL, the Threonine operon attenuator, is located at bases 190 to 255 of the K12 genome. To create a sequence referencing another sequence data, we can create a BioSeq with a RefInst instance, which reference a sequence at a given location.

[12]:

from pyncbitk.objects.seqinst import RefInst
thrl_id = LocalId(ObjectId("thrL"))
thrl = BioSeq(RefInst(SeqIntervalLoc(k12.id, 189, 254)), thrl_id)
scope.add_bioseq(thrl)

[12]:

<pyncbitk.objmgr.BioSeqHandle at 0x77b0fc2eed90>

Then we can use the whole thrL sequence as a query, without having to copy the data internally:

[13]:

query_loc = WholeSeqLoc(thrl.id)
subject_loc = WholeSeqLoc(o157.id)
results = blastn.run(SearchQuery(query_loc, scope), SearchQuery(subject_loc, scope))
for alignment in results[0].alignments:
    print(alignment[0].id, alignment[1].id, alignment.evalue, alignment.bitscore, alignment.matches, alignment.percent_identity, alignment.alignment_length, sep="\t")

thrL    AE014075.1      1.4299475982740963e-28  120.30864505849613      66      100.0   66