{ "cells": [ { "cell_type": "markdown", "id": "af18933e-ecb9-40ea-a582-ea54a3f1df58", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# BLAST with the Object Manager\n", "\n", "In the previous section, we introduced how to run BLAST queries using by loading the data in `BioSeq` objects, and then using the `Blast.run` to get alignments. This approach presents limitations: every sequence we pass to BLAST needs to be a `ContinuousInst`, and we cannot use the full flexibility offered by the NCBI C++ Toolkit data model. " ] }, { "cell_type": "markdown", "id": "9892fddd-8110-44e6-a402-a39cf8882b79", "metadata": { "editable": true, "raw_mimetype": "", "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## The object manager and scopes\n", "\n", "The object manager is a *singleton* that lives for the whole process that imports `pyncbitk`. It enables the creation of *scopes*, which allows the management of objects for a given code region. In Python, scopes implement the context manager protocol, which allows controlling when objects can be safely discarded when they go out of scope. Scopes can be created from an `ObjectManager` object, which can always be instantiated for free:" ] }, { "cell_type": "code", "execution_count": null, "id": "b2872300-ff03-4a3b-8746-3b4d338cc2ea", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.objmgr import ObjectManager\n", "\n", "with ObjectManager().scope() as scope:\n", " pass" ] }, { "cell_type": "markdown", "id": "e07589b9-24b4-4b7a-8372-a295503185ad", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The `add` method of the `Scope` object can be used to add a `BioSeq` to the current scope. Let's create a test sequence we can add to the scope." ] }, { "cell_type": "code", "execution_count": null, "id": "d7bcabfc-0be0-46df-88d1-8ad23c2e0285", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.objects.seq import BioSeq\n", "from pyncbitk.objects.seqid import LocalId\n", "from pyncbitk.objects.general import ObjectId\n", "from pyncbitk.objects.seqdata import IupacNaData\n", "from pyncbitk.objects.seqinst import ContinuousInst\n", "\n", "inst = ContinuousInst(IupacNaData(\"ATGC\"))\n", "seqid = LocalId(ObjectId(\"seq1\"))\n", "seq = BioSeq(inst, seqid)" ] }, { "cell_type": "code", "execution_count": null, "id": "bb7c1a75-d5cb-4920-a45f-a698f48f2c06", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "with ObjectManager().scope() as scope:\n", " handle = scope.add_bioseq(seq)\n", " print(handle)" ] }, { "cell_type": "markdown", "id": "dc406dc8-ddd8-4613-84b8-36ed4698ace8", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The resulting object we get after adding a `BioSeq` to a scope is a `BioSeqHandle`: a handle which acts as a proxy to a `BioSeq`, allowing to retrieve the attributes of a sequence managed by the scope. Note that you shouldn't use a handle obtained from a scope after that scope was closed, as the data may have been released already. Just like proxy and `ref` objects in Python, calling the handle will load the original sequence." ] }, { "cell_type": "code", "execution_count": null, "id": "0d3afb9f-d982-41c4-9701-de492dc1e766", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "with ObjectManager().scope() as scope:\n", " handle = scope.add_bioseq(seq)\n", " print(handle())" ] }, { "cell_type": "markdown", "id": "5eab37ae-766f-4f8c-8787-c27e55b56d99", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "You can also obtain a `BioSeqHandle` by indexing the scope with a `SeqId` object, which allows retrieving data in a scope." ] }, { "cell_type": "code", "execution_count": null, "id": "04846068-ff02-4572-abd7-dfacf593ce6e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "with ObjectManager().scope() as scope:\n", " scope.add_bioseq(seq)\n", "\n", " handle = scope[seqid]\n", " print(handle())" ] }, { "cell_type": "markdown", "id": "565171c8-1b67-4a71-b3f4-c2929d5a8ca2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Efficient BLAST data storage" ] }, { "cell_type": "markdown", "id": "12bb8b9e-5f62-41ff-8387-3a744c9ed44b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The main advantage of using the object manager for BLAST queries is that the sequence storage is much more flexible. In the previous example, we loaded the `BioSeq` using a `FastaReader` with the `split=False` parameter. This caused the `FastaReader` to only produce sequences where the instance was a `ContinuousInst`, i.e. a single long sequence encoded in an ASCII string, potentially wasting space for unknown regions. However, now that we are gonna use the object manager, we can load the sequence in the best instance type:" ] }, { "cell_type": "code", "execution_count": null, "id": "23086b23-4d29-4f72-9d9f-f5934fec6aa2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import pathlib\n", "from pyncbitk.objtools import FastaReader\n", "\n", "k12 = FastaReader(pathlib.Path(\"data\").joinpath(\"LN832404.fna\")).read()\n", "o157 = FastaReader(pathlib.Path(\"data\").joinpath(\"AE014075.fna\")).read()\n", "\n", "type(k12.instance)" ] }, { "cell_type": "markdown", "id": "402b3857-d057-4999-bcbd-7f3b0a08c68f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The `BioSeq` we just loaded is now instantiated as a `DeltaInst`, which combines different sequence blocks that can each be in different encodings. This way, the sequence regions without ambiguity can be stored in `Ncbi2NaData` objects with 2-bit encoding, effectively saving space and improving data throughput." ] }, { "cell_type": "markdown", "id": "5e5cc1ee-dd7b-46e1-a285-261aceb659ed", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Scoped data queries with BLAST" ] }, { "cell_type": "markdown", "id": "3893a24a-25d5-4a13-a265-437f594a7e2c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Let's make a global scope for the rest of this example, and add our sequences to it:" ] }, { "cell_type": "code", "execution_count": null, "id": "f1bc047f-ba32-4570-a08a-5a2c5a872d62", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "scope = ObjectManager().scope()\n", "scope.add_bioseq(k12)\n", "scope.add_bioseq(o157)" ] }, { "cell_type": "markdown", "id": "d246fd78-08fd-459c-9c4f-f37f699d4b56", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Once we have added the sequence data to the scope, we can refer to our query and subject sequences not in term of `BioSeq` objects, which actually contain the sequence data, but simply in terms of sequence location (`SeqLoc` objects) which describe a location in a sequence found in the scope. To use the two complete sequences, we can describe the locations as `WholeSeqLoc` objects:" ] }, { "cell_type": "code", "execution_count": null, "id": "12acea2b-a56b-4944-bd21-3a9923a8ae27", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.objects.seqloc import WholeSeqLoc\n", "query_loc = WholeSeqLoc(k12.id)\n", "subject_loc = WholeSeqLoc(o157.id)" ] }, { "cell_type": "markdown", "id": "e8645e71-5ca4-4843-85db-ba324ffbdb02", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "When we have the locations of our sequences, we need to wrap them into a `pyncbitk.algo.blast.SearchQuery`, a dedicated object which stores a sequence location and the scope where this location can be resolved. If you have more than one query or subject, you can in turn wrap them into a `pyncbitk.algo.blast.SearchQueryVector`:" ] }, { "cell_type": "code", "execution_count": null, "id": "ed9457cc-5ccd-422c-8d87-cf26136c31d7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.algo.blast import SearchQuery\n", "query_sq = SearchQuery(query_loc, scope)\n", "subject_sq = SearchQuery(subject_loc, scope)" ] }, { "cell_type": "markdown", "id": "8fb2191a-7446-45ba-8a4c-619406d17bb4", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "You can now run the BLAST search the same as before, passing `SearchQuery` and `SearchQueryVector` instead of `BioSeq` and `BioSeqSet` to use scoped data:" ] }, { "cell_type": "code", "execution_count": null, "id": "eb893aa0-ef0f-4082-a0b0-a9e8c351374a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.algo.blast import BlastN\n", "blastn = BlastN(evalue=1e-5)\n", "results = blastn.run(query_sq, subject_sq)\n", "print(len(results[0].alignments))" ] }, { "cell_type": "markdown", "id": "d9c480c3-90b7-4a5d-92ff-8b649209ff62", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Building complex queries" ] }, { "cell_type": "markdown", "id": "fd9b0941-c75d-4f0b-ba92-8fb9b0713996", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Now that the data is loaded in the local scope, it is really easy to run more complex queries about regions of each sequence without having to copy data around. For instance, to limit the homology search to the first 10,000 bases of each sequences, we could use a `SeqIntervalLoc`:" ] }, { "cell_type": "code", "execution_count": null, "id": "7a1dfc7c-8f50-49cd-87c8-7e45be17240a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.objects.seqloc import SeqIntervalLoc\n", "query_loc = SeqIntervalLoc(k12.id, 0, 9999)\n", "subject_loc = SeqIntervalLoc(o157.id, 0, 9999)\n", "results = blastn.run(SearchQuery(query_loc, scope), SearchQuery(subject_loc, scope))\n", "print(len(results[0].alignments))" ] }, { "cell_type": "markdown", "id": "82a8555d-e007-4ef8-ae27-30d8900e9eac", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Otherwise, we can also create a sequence that references the K12 genome without having to copy the data around. For instance, `thrL`, the Threonine operon attenuator, is located at bases 190 to 255 of the K12 genome. To create a sequence referencing another sequence data, we can create a `BioSeq` with a `RefInst` instance, which reference a sequence at a given location." ] }, { "cell_type": "code", "execution_count": null, "id": "a48aeabc-eb43-4a53-abc9-d1d4de6ce647", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from pyncbitk.objects.seqinst import RefInst\n", "thrl_id = LocalId(ObjectId(\"thrL\"))\n", "thrl = BioSeq(RefInst(SeqIntervalLoc(k12.id, 189, 254)), thrl_id)\n", "scope.add_bioseq(thrl)" ] }, { "cell_type": "markdown", "id": "3934eea2-bcb9-4166-9e0b-6c0873a056f7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Then we can use the whole `thrL` sequence as a query, without having to copy the data internally:" ] }, { "cell_type": "code", "execution_count": null, "id": "c64d78de-fe2b-427c-b0f3-38977da3b6bd", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "query_loc = WholeSeqLoc(thrl.id)\n", "subject_loc = WholeSeqLoc(o157.id)\n", "results = blastn.run(SearchQuery(query_loc, scope), SearchQuery(subject_loc, scope))\n", "for alignment in results[0].alignments:\n", " print(alignment[0].id, alignment[1].id, alignment.evalue, alignment.bitscore, alignment.matches, alignment.percent_identity, alignment.alignment_length, sep=\"\\t\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }