My first pairwise BLASTn#
The NCBI C++ Toolkit is a rich library that contains a comprehensive data model and interfaces to several sequence analysis methods. The most important one is BLAST, the Basic Local Alignment Search Tool, which allows retrieving sequences similar to a query in a database of sequences, among other things. The PyNCBItk library implements a high-level interface to the BLAST methods (such as blastn, blastp, etc.) in the
pyncbitk.algo.blast module.
[1]:
import pyncbitk
pyncbitk.__version__
[1]:
'0.1.0-alpha.5'
One of the easiest application of BLAST is running a pairwise nucleotide BLAST. It takes a query sequence, a subject sequence, and returns the local alignments between these two sequences. Being one of the most common analyses, it should not be too hard to set-up, right?
Getting sequence data#
Let’s start by downloading some data, for instance two related genomes of Escherichia coli, of strains K12 and K.
[2]:
import urllib.request
import shutil
import pathlib
genomes = {
"LN832404": 802133627, # Escherichia coli K-12
"AE014075": 26111730, # Escherichia coli O157
}
for accession, id_ in genomes.items():
with urllib.request.urlopen(f"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id={id_}") as res:
with pathlib.Path("data").joinpath(f"{accession}.fna").open("wb") as dst:
shutil.copyfileobj(res, dst)
Now that we have two FASTA files, we should be able to run our blastn query. With the command line, we’d run something like:
$ blastn -query data/LN832404.fna -subject data/AE014075.fna
However, with the NCBI C++ Toolkit and PyNCBItk, we cannot simply use filenames: we need to load data first.
Loading data with the NCBI Toolkit#
Since our sequences are in FASTA format, we can use the FastaReader class from the pyncbitk.objtools module to load the sequences into Python objects. Let’s use the FastaReader, and ignore the split=False argument for now (don’t worry, I will explain later).
[3]:
from pyncbitk.objtools import FastaReader
k12 = FastaReader(pathlib.Path("data").joinpath("LN832404.fna"), split=False).read()
o157 = FastaReader(pathlib.Path("data").joinpath("AE014075.fna"), split=False).read()
The two objects we just loaded are instances of the BioSeq class. They store the identifier(s) and the sequence data for each of these two sequences.
Setting up the BLAST query#
Now that the data has been loaded, we can prepare a BLASTn runner, and configure it with the same configuration values as the command line:
[4]:
from pyncbitk.algo.blast import BlastN
blastn = BlastN(evalue=1e-5)
Running the BLAST query#
Our BLASTn runner is now configured! All we have to do is run the search: for this, we can use the run method common to all Blast runner objects.
[5]:
results = blastn.run(k12, o157)
Great! Our search succeeded. However… What exactly is in these results? The BLAST binaries in the command line let us configure the output, or redirects the alignments to the standard output. Here, none of that happened. So what exactly are those results?
[6]:
type(results)
[6]:
pyncbitk.algo.blast.SearchResultsSet
Analyzing the BLAST results#
Blast.run always returns a SearchResultSet, which is a list of SearchResults objects, with at most one for each query. Since we only had one query, we can just use the first (and only) element.
[7]:
result = results[0]
type(result)
[7]:
pyncbitk.algo.blast.SearchResults
We now have a single SearchResults, which contains the results for our query. We can check the identifier of the query with the query_id attribute, and the alignments with the alignments attribute:
[8]:
print("Query:", result.query_id)
print("Alignments:", len(result.alignments))
Query: LN832404.1
Alignments: 17578
Each SeqAlign objects in the alignments have various attributes that can be used to display the identifiers of the query and target sequences, the bit-score and E-value of the BLAST alignment:
[9]:
from itertools import islice
for alignment in islice(result.alignments, 10): # show the first 10 alignments
print(alignment[0].id, alignment[1].id, alignment.evalue, alignment.bitscore, alignment.matches, alignment.percent_identity, alignment.alignment_length, sep="\t")
LN832404.1 AE014075.1 0.0 124555.46097493326 71158 98.08132322536181 72550
LN832404.1 AE014075.1 0.0 121173.24278844919 69295 98.05433705957266 70670
LN832404.1 AE014075.1 0.0 105728.2906913323 60559 97.92852522639069 61840
LN832404.1 AE014075.1 0.0 82750.66711197387 47362 97.96467132751417 48346
LN832404.1 AE014075.1 0.0 77825.66691613918 45027 97.30097674820641 46276
LN832404.1 AE014075.1 0.0 76233.29226475798 43942 97.54051054384017 45050
LN832404.1 AE014075.1 0.0 66291.32006423191 37901 98.03419466645974 38661
LN832404.1 AE014075.1 0.0 64285.072272995705 36858 97.86522223992353 37662
LN832404.1 AE014075.1 0.0 62406.86366663838 36396 96.85969767936982 37576
LN832404.1 AE014075.1 0.0 59850.58839106325 34281 97.94851281465185 34999
Note that we didn’t get the respective coordinates in the query and subject sequences. That’s because the alignment format of the core object model (the SeqAlign class) actually stores the data in a compact form, which makes it more complicated to extract coordinates. Luckily, a dedicated class for that is available in the objtools module, the AlignMap:
[10]:
from itertools import islice
from pyncbitk.objtools import AlignMap
for alignment in islice(result.alignments, 5): # show the first 5 alignments
alimap = AlignMap(alignment.segments)
print(alignment[0].id, alignment[1].id, alimap[0].sequence_start, alimap[0].sequence_stop, alimap[1].sequence_start, alimap[1].sequence_stop, sep="\t")
LN832404.1 AE014075.1 3423680 3496202 3858536 3931083
LN832404.1 AE014075.1 1140100 1210732 1241057 1311709
LN832404.1 AE014075.1 1890622 1952452 2038465 2100294
LN832404.1 AE014075.1 3905052 3953346 4409571 4457913
LN832404.1 AE014075.1 943396 989608 954985 1001251
Good job, you just ran your first pairwise BLAST!