My first pairwise BLASTn#

The NCBI C++ Toolkit is a rich library that contains a comprehensive data model and interfaces to several sequence analysis methods. The most important one is BLAST, the Basic Local Alignment Search Tool, which allows retrieving sequences similar to a query in a database of sequences, among other things. The PyNCBItk library implements a high-level interface to the BLAST methods (such as blastn, blastp, etc.) in the pyncbitk.algo.blast module.

[1]:

import pyncbitk
pyncbitk.__version__

[1]:

'0.1.0-alpha.5'

One of the easiest application of BLAST is running a pairwise nucleotide BLAST. It takes a query sequence, a subject sequence, and returns the local alignments between these two sequences. Being one of the most common analyses, it should not be too hard to set-up, right?

Getting sequence data#

Let’s start by downloading some data, for instance two related genomes of Escherichia coli, of strains K12 and K.

[2]:

import urllib.request
import shutil
import pathlib

genomes = {
    "LN832404": 802133627,  # Escherichia coli K-12
    "AE014075": 26111730,   # Escherichia coli O157
}

for accession, id_ in genomes.items():
    with urllib.request.urlopen(f"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id={id_}") as res:
        with pathlib.Path("data").joinpath(f"{accession}.fna").open("wb") as dst:
            shutil.copyfileobj(res, dst)

Now that we have two FASTA files, we should be able to run our blastn query. With the command line, we’d run something like:

$ blastn -query data/LN832404.fna -subject data/AE014075.fna

However, with the NCBI C++ Toolkit and PyNCBItk, we cannot simply use filenames: we need to load data first.

Loading data with the NCBI Toolkit#

Since our sequences are in FASTA format, we can use the FastaReader class from the pyncbitk.objtools module to load the sequences into Python objects. Let’s use the FastaReader, and ignore the split=False argument for now (don’t worry, I will explain later).

[3]:

from pyncbitk.objtools import FastaReader

k12 = FastaReader(pathlib.Path("data").joinpath("LN832404.fna"), split=False).read()
o157 = FastaReader(pathlib.Path("data").joinpath("AE014075.fna"), split=False).read()

The two objects we just loaded are instances of the BioSeq class. They store the identifier(s) and the sequence data for each of these two sequences.

Setting up the BLAST query#

Now that the data has been loaded, we can prepare a BLASTn runner, and configure it with the same configuration values as the command line:

[4]:

from pyncbitk.algo.blast import BlastN
blastn = BlastN(evalue=1e-5)

Running the BLAST query#

Our BLASTn runner is now configured! All we have to do is run the search: for this, we can use the run method common to all Blast runner objects.

[5]:

results = blastn.run(k12, o157)

Great! Our search succeeded. However… What exactly is in these results? The BLAST binaries in the command line let us configure the output, or redirects the alignments to the standard output. Here, none of that happened. So what exactly are those results?

[6]:

type(results)

[6]:

pyncbitk.algo.blast.SearchResultsSet

Analyzing the BLAST results#

Blast.run always returns a SearchResultSet, which is a list of SearchResults objects, with at most one for each query. Since we only had one query, we can just use the first (and only) element.

[7]:

result = results[0]
type(result)

[7]:

pyncbitk.algo.blast.SearchResults

We now have a single SearchResults, which contains the results for our query. We can check the identifier of the query with the query_id attribute, and the alignments with the alignments attribute:

[8]:

print("Query:", result.query_id)
print("Alignments:", len(result.alignments))

Query: LN832404.1
Alignments: 17578

Each SeqAlign objects in the alignments have various attributes that can be used to display the identifiers of the query and target sequences, the bit-score and E-value of the BLAST alignment:

[9]:

from itertools import islice
for alignment in islice(result.alignments, 10):  # show the first 10 alignments
    print(alignment[0].id, alignment[1].id, alignment.evalue, alignment.bitscore, alignment.matches, alignment.percent_identity, alignment.alignment_length, sep="\t")

LN832404.1      AE014075.1      0.0     124555.46097493326      71158   98.08132322536181       72550
LN832404.1      AE014075.1      0.0     121173.24278844919      69295   98.05433705957266       70670
LN832404.1      AE014075.1      0.0     105728.2906913323       60559   97.92852522639069       61840
LN832404.1      AE014075.1      0.0     82750.66711197387       47362   97.96467132751417       48346
LN832404.1      AE014075.1      0.0     77825.66691613918       45027   97.30097674820641       46276
LN832404.1      AE014075.1      0.0     76233.29226475798       43942   97.54051054384017       45050
LN832404.1      AE014075.1      0.0     66291.32006423191       37901   98.03419466645974       38661
LN832404.1      AE014075.1      0.0     64285.072272995705      36858   97.86522223992353       37662
LN832404.1      AE014075.1      0.0     62406.86366663838       36396   96.85969767936982       37576
LN832404.1      AE014075.1      0.0     59850.58839106325       34281   97.94851281465185       34999

Note that we didn’t get the respective coordinates in the query and subject sequences. That’s because the alignment format of the core object model (the SeqAlign class) actually stores the data in a compact form, which makes it more complicated to extract coordinates. Luckily, a dedicated class for that is available in the objtools module, the AlignMap:

[10]:

from itertools import islice
from pyncbitk.objtools import AlignMap
for alignment in islice(result.alignments, 5):  # show the first 5 alignments
    alimap = AlignMap(alignment.segments)
    print(alignment[0].id, alignment[1].id, alimap[0].sequence_start, alimap[0].sequence_stop, alimap[1].sequence_start, alimap[1].sequence_stop, sep="\t")

LN832404.1      AE014075.1      3423680 3496202 3858536 3931083
LN832404.1      AE014075.1      1140100 1210732 1241057 1311709
LN832404.1      AE014075.1      1890622 1952452 2038465 2100294
LN832404.1      AE014075.1      3905052 3953346 4409571 4457913
LN832404.1      AE014075.1      943396  989608  954985  1001251

Good job, you just ran your first pairwise BLAST!