{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a73d2428-1397-4ad4-952b-2703641918ee",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# My first pairwise BLASTn"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51199bc0-f0d7-4d27-a039-d3fdd939cc8e",
   "metadata": {},
   "source": [
    "The NCBI C++ Toolkit is a rich library that contains a comprehensive data model and interfaces to several sequence analysis methods. The most important one is [BLAST](https://en.wikipedia.org/wiki/BLAST_(biotechnology)), the Basic Local Alignment Search Tool, which allows retrieving sequences similar to a query in a database of sequences, among other things. The PyNCBItk library implements a high-level interface to the BLAST methods (such as `blastn`, `blastp`, etc.) in the `pyncbitk.algo.blast` module. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82cd73f6-86d8-4e36-b757-ef2f7b1f68b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyncbitk\n",
    "pyncbitk.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db311999-60b5-4c5c-b618-847cd3e7c88c",
   "metadata": {},
   "source": [
    "One of the easiest application of BLAST is running a pairwise nucleotide BLAST. It takes a query sequence, a subject sequence, and returns the local alignments between these two sequences. Being one of the most common analyses, it should not be too hard to set-up, right? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cd0aa5e",
   "metadata": {},
   "source": [
    "## Getting sequence data\n",
    "\n",
    "Let's start by downloading some data, for instance two related genomes of *Escherichia coli*, of strains K12 and K."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7300e447-9d96-4e58-ac98-1989ae1cd5f7",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import shutil\n",
    "import pathlib\n",
    "\n",
    "genomes = {\n",
    "    \"LN832404\": 802133627,  # Escherichia coli K-12\n",
    "    \"AE014075\": 26111730,   # Escherichia coli O157\n",
    "}\n",
    "\n",
    "for accession, id_ in genomes.items():\n",
    "    with urllib.request.urlopen(f\"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=nuccore&report=fasta&id={id_}\") as res:\n",
    "        with pathlib.Path(\"data\").joinpath(f\"{accession}.fna\").open(\"wb\") as dst:\n",
    "            shutil.copyfileobj(res, dst)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "247c8351-3c5d-430c-9da1-96ee0f7ac170",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Now that we have two FASTA files, we should be able to run our `blastn` query. With the command line, we'd run something like:\n",
    "```console\n",
    "$ blastn -query data/LN832404.fna -subject data/AE014075.fna\n",
    "```\n",
    "\n",
    "However, with the NCBI C++ Toolkit and PyNCBItk, we cannot simply use filenames: we need to load data first. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6814744d",
   "metadata": {},
   "source": [
    "## Loading data with the NCBI Toolkit\n",
    "\n",
    "Since our sequences are in FASTA format, we can use the `FastaReader` class from the `pyncbitk.objtools` module to load the sequences into Python objects. Let's use the `FastaReader`, and ignore the `split=False` argument for now (don't worry, I will explain later)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a33937b-e07d-4276-a61c-33c95ea08f49",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pyncbitk.objtools import FastaReader\n",
    "\n",
    "k12 = FastaReader(pathlib.Path(\"data\").joinpath(\"LN832404.fna\"), split=False).read()\n",
    "o157 = FastaReader(pathlib.Path(\"data\").joinpath(\"AE014075.fna\"), split=False).read()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c7c2a07-1e4a-47ee-80ae-358b5fbd00f3",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "The two objects we just loaded are instances of the `BioSeq` class. They store the identifier(s) and the sequence data for each of these two sequences. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc40219e",
   "metadata": {},
   "source": [
    "## Setting up the BLAST query\n",
    "\n",
    "Now that the data has been loaded, we can prepare a BLASTn runner, and configure it with the same configuration values as the command line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b68e7442-f848-432f-8a71-5e12b90e4c51",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pyncbitk.algo.blast import BlastN\n",
    "blastn = BlastN(evalue=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e323b24-eaf3-479a-98ca-f3f7bf7696fa",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Running the BLAST query\n",
    "\n",
    "Our BLASTn runner is now configured! All we have to do is run the search: for this, we can use the `run` method common to all `Blast` runner objects. \n",
    "\n",
    "<!-- using a dedicated object implementing [Command pattern](https://en.wikipedia.org/wiki/Command_pattern) allows reusing a BLASTn configuration for different query/subject pairs, which ensure they are always using the same parameters. In particular, `BlastN` instances are thread-safe, and the `BlastN.run -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71b05f4f-a078-40ea-819b-7d44222c0421",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "results = blastn.run(k12, o157)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa577c47-5a2e-49f9-9c7d-586a960281ee",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Great! Our search succeeded. However... What exactly is in these `results`? The BLAST binaries in the command line let us configure the output, or redirects the alignments to the standard output. Here, none of that happened. So what exactly are those results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79d69708-e125-4a92-9c10-3159148a4ed2",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "type(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9987f81e-c82a-469d-b4ab-a24f8282a340",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Analyzing the BLAST results\n",
    "\n",
    "`Blast.run` always returns a `SearchResultSet`, which is a list of `SearchResults` objects, with at most one for each query. Since we only had one query, we can just use the first (and only) element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0c5e66c-9c1f-407a-89d2-a54170a83bbf",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = results[0]\n",
    "type(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "398a96c3-1a24-4bc0-9f25-72e8ada44567",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "We now have a single `SearchResults`, which contains the results for our query. We can check the identifier of the query with the `query_id` attribute, and the alignments with the `alignments` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8314fe6-2865-46b1-9617-bb73fc00e03e",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\"Query:\", result.query_id)\n",
    "print(\"Alignments:\", len(result.alignments))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc6968a4-a1bb-495c-bdc0-b6bda31d2d4e",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Each `SeqAlign` objects in the alignments have various attributes that can be used to display the identifiers of the query and target sequences, the bit-score and E-value of the BLAST alignment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27fcbc81-6fae-47a7-ad70-cbdf32be2e8a",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from itertools import islice\n",
    "for alignment in islice(result.alignments, 10):  # show the first 10 alignments\n",
    "    print(alignment[0].id, alignment[1].id, alignment.evalue, alignment.bitscore, alignment.matches, alignment.percent_identity, alignment.alignment_length, sep=\"\\t\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3b9b2e8",
   "metadata": {},
   "source": [
    "Note that we didn't get the respective coordinates in the query and subject sequences. That's because the alignment format of the core object model (the `SeqAlign` class) actually stores the data in a compact form, which makes it more complicated to extract coordinates. Luckily, a dedicated class for that is available in the `objtools` module, the `AlignMap`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "024b53af",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itertools import islice\n",
    "from pyncbitk.objtools import AlignMap\n",
    "for alignment in islice(result.alignments, 5):  # show the first 5 alignments\n",
    "    alimap = AlignMap(alignment.segments)\n",
    "    print(alignment[0].id, alignment[1].id, alimap[0].sequence_start, alimap[0].sequence_stop, alimap[1].sequence_start, alimap[1].sequence_stop, sep=\"\\t\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8044db0-2d54-40c0-bf6b-486947965bf5",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Good job, you just ran your first pairwise BLAST!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}