如何通过 Entrez 使用基因名称检索 NCBI 摘要?

问题描述 投票:0回答:1

我在网上探索了各种选项和解决方案,但我似乎不太明白这一点。我刚开始使用 Entrez,所以我不完全理解它是如何工作的,但以下是我的尝试。

我的目标是打印出在线摘要,例如对于 Kat2a,我希望它打印出“启用 H3 组蛋白乙酰转移酶活性;染色质结合活性;和组蛋白乙酰转移酶活性(H4-K12 特异性)。参与多个流程的...等,来自 NCBI 的摘要部分。

def get_summary(gene_name):
    Entrez.email = 'x'

    query = f'{gene_name}[Gene Name]'
    handle = Entrez.esearch(db='gene', term=query)
    record = Entrez.read(handle)
    handle.close()

    NCBI_ids = record['IdList']
    for id in NCBI_ids:
        handle = Entrez.esummary(db='gene', id=id)
        record = Entrez.read(handle)
        print(record['Summary'])
    return 0
python bioinformatics ncbi
1个回答
0
投票

使用 Biopython 获取与提供的基因名称相关的所有基因 ID1 并收集每个 ID 的所有基因摘要²

  • [1]:使用
    Bio.Entrez.esearch
  • [2]:使用
    Bio.Entrez.efetch

你走在正确的道路上。这是一个例子,进一步充实了您在问题中发起和提供的方法。下面的函数(当然可以进行更多自定义)考虑了默认的

Entrez.esearch
最大返回基因 ID 20(默认覆盖为 100),并且还按生物体执行查询本身过滤(除非默认的 ' human' 设置为
None
)。

import time
import xmltodict

from collections import defaultdict

from Bio import Entrez


def get_entrez_gene_summary(
    gene_name, email, organism="human", max_gene_ids=100
):
    """Returns the 'Summary' contents for provided input
    gene from the Entrez Gene database. All gene IDs 
    returned for input gene_name will have their docsum
    summaries 'fetched'.
    
    Args:
        gene_name (string): Official (HGNC) gene name 
           (e.g., 'KAT2A')
        email (string): Required email for making requests
        organism (string, optional): defaults to human. 
           Filters results only to match organism. Set to None
           to return all organism unfiltered.
        max_gene_ids (int, optional): Sets the number of Gene
           ID results to return (absolute max allowed is 10K).
        
    Returns:
        dict: Summaries for all gene IDs associated with 
           gene_name (where: keys → [orgn][gene name],
                      values → gene summary)
    """
    Entrez.email = email

    query = (
        f"{gene_name}[Gene Name]"
        if not organism
        else f"({gene_name}[Gene Name]) AND {organism}[Organism]"
    )
    handle = Entrez.esearch(db="gene", term=query, retmax=max_gene_ids)
    record = Entrez.read(handle)
    handle.close()

    gene_summaries = defaultdict(dict)
    gene_ids = record["IdList"]

    print(
        f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
    )
    for gene_id in gene_ids:
        print(f"\tRetrieving summary for {gene_id}...")
        handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
        gene_dict = xmltodict.parse(
            "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
            dict_constructor=dict,
        )
        gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
            "DocumentSummary"
        ]
        name = gene_docsum.get("Name")
        summary = gene_docsum.get("Summary")
        gene_organism = gene_docsum.get("Organism")["CommonName"]
        gene_summaries[gene_organism][name] = summary
        handle.close()
        time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second

    return gene_summaries


示例 1 – 获取 KAT2A 的基因摘要

>>> email = # [insert private email]
>>> gene_summaries = get_entrez_gene_summary("KAT2A", email)

仅返回一个基因摘要(记住默认值是

organism='human'
):

1. KAT2A
KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]

示例 2 – 使用通配符并接收单个生物体的多个基因

例如,可以使用查询

ALDH*
(星号代表通配符)获得所有人类乙醛脱氢酶基因的基因摘要:

>>> email = # enter private email
>>> gene_summaries = get_entrez_gene_summary("ALDH*", email, max_gene_ids=50)
28 gene IDs returned associated with gene ALDH*.
    Retrieving summary for 217...
    Retrieving summary for 216...
    Retrieving summary for 501...
    Retrieving summary for 220...
    Retrieving summary for 224...
    Retrieving summary for 7915...
    Retrieving summary for 218...
    Retrieving summary for 5832...
    Retrieving summary for 219...
    Retrieving summary for 10840...
    Retrieving summary for 8854...
    Retrieving summary for 8540...
    Retrieving summary for 223...
    Retrieving summary for 8659...
    Retrieving summary for 4329...
    Retrieving summary for 221...
    Retrieving summary for 222...
    Retrieving summary for 126133...
    Retrieving summary for 160428...
    Retrieving summary for 64577...
    Retrieving summary for 541...
    Retrieving summary for 100862662...
    Retrieving summary for 544...
    Retrieving summary for 543...
    Retrieving summary for 542...
    Retrieving summary for 101927751...
    Retrieving summary for 283665...
    Retrieving summary for 100874204...
>>> for i, (k, v) in enumerate(gene_summaries["human"].items()):
...    print(f"{i+1}. {k}")
...    print(v, end="\n\n")
1. ALDH2
This protein belongs to the aldehyde dehydrogenase family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. Two major liver isoforms of aldehyde dehydrogenase, cytosolic and mitochondrial, can be distinguished by their electrophoretic mobilities, kinetic properties, and subcellular localizations. Most Caucasians have two major isozymes, while approximately 50% of East Asians have the cytosolic isozyme but not the mitochondrial isozyme. A remarkably higher frequency of acute alcohol intoxication among East Asians than among Caucasians could be related to the absence of a catalytically active form of the mitochondrial isozyme. The increased exposure to acetaldehyde in individuals with the catalytically inactive form may also confer greater susceptibility to many types of cancer. This gene encodes a mitochondrial isoform, which has a low Km for acetaldehydes, and is localized in mitochondrial matrix. Alternative splicing results in multiple transcript variants encoding distinct isoforms.[provided by RefSeq, Nov 2016]

2. ALDH1A1
The protein encoded by this gene belongs to the aldehyde dehydrogenase family. Aldehyde dehydrogenase is the next enzyme after alcohol dehydrogenase in the major pathway of alcohol metabolism. There are two major aldehyde dehydrogenase isozymes in the liver, cytosolic and mitochondrial, which are encoded by distinct genes, and can be distinguished by their electrophoretic mobility, kinetic properties, and subcellular localization. This gene encodes the cytosolic isozyme. Studies in mice show that through its role in retinol metabolism, this gene may also be involved in the regulation of the metabolic responses to high-fat diet. [provided by RefSeq, Mar 2011]

3. ALDH7A1
The protein encoded by this gene is a member of subfamily 7 in the aldehyde dehydrogenase gene family. These enzymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This particular member has homology to a previously described protein from the green garden pea, the 26g pea turgor protein. It is also involved in lysine catabolism that is known to occur in the mitochondrial matrix. Recent reports show that this protein is found both in the cytosol and the mitochondria, and the two forms likely arise from the use of alternative translation initiation sites. An additional variant encoding a different isoform has also been found for this gene. Mutations in this gene are associated with pyridoxine-dependent epilepsy. Several related pseudogenes have also been identified. [provided by RefSeq, Jan 2011]

4. ALDH1A3
This gene encodes an aldehyde dehydrogenase enzyme that uses retinal as a substrate. Mutations in this gene have been associated with microphthalmia, isolated 8, and expression changes have also been detected in tumor cells. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2014]

5. ALDH3A2
Aldehyde dehydrogenase isozymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This gene product catalyzes the oxidation of long-chain aliphatic aldehydes to fatty acid. Mutations in the gene cause Sjogren-Larsson syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]

6. ALDH5A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This gene encodes a mitochondrial NAD(+)-dependent succinic semialdehyde dehydrogenase. A deficiency of this enzyme, known as 4-hydroxybutyricaciduria, is a rare inborn error in the metabolism of the neurotransmitter 4-aminobutyric acid (GABA). In response to the defect, physiologic fluids from patients accumulate GHB, a compound with numerous neuromodulatory properties. Two transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, Jul 2008]

7. ALDH3A1
Aldehyde dehydrogenases oxidize various aldehydes to the corresponding acids. They are involved in the detoxification of alcohol-derived acetaldehyde and in the metabolism of corticosteroids, biogenic amines, neurotransmitters, and lipid peroxidation. The enzyme encoded by this gene forms a cytoplasmic homodimer that preferentially oxidizes aromatic and medium-chain (6 carbons or more) saturated and unsaturated aldehyde substrates. It is thought to promote resistance to UV and 4-hydroxy-2-nonenal-induced oxidative damage in the cornea. The gene is located within the Smith-Magenis syndrome region on chromosome 17. Multiple alternatively spliced variants, encoding the same protein, have been identified. [provided by RefSeq, Sep 2008]

8. ALDH18A1
This gene is a member of the aldehyde dehydrogenase family and encodes a bifunctional ATP- and NADPH-dependent mitochondrial enzyme with both gamma-glutamyl kinase and gamma-glutamyl phosphate reductase activities. The encoded protein catalyzes the reduction of glutamate to delta1-pyrroline-5-carboxylate, a critical step in the de novo biosynthesis of proline, ornithine and arginine. Mutations in this gene lead to hyperammonemia, hypoornithinemia, hypocitrullinemia, hypoargininemia and hypoprolinemia and may be associated with neurodegeneration, cataracts and connective tissue diseases. Alternatively spliced transcript variants, encoding different isoforms, have been described for this gene. [provided by RefSeq, Jul 2008]

9. ALDH1B1
This protein belongs to the aldehyde dehydrogenases family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. This gene does not contain introns in the coding sequence. The variation of this locus may affect the development of alcohol-related problems. [provided by RefSeq, Jul 2008]

10. ALDH1L1
The protein encoded by this gene catalyzes the conversion of 10-formyltetrahydrofolate, nicotinamide adenine dinucleotide phosphate (NADP+), and water to tetrahydrofolate, NADPH, and carbon dioxide. The encoded protein belongs to the aldehyde dehydrogenase family. Loss of function or expression of this gene is associated with decreased apoptosis, increased cell motility, and cancer progression. There is an antisense transcript that overlaps on the opposite strand with this gene locus. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2012]

11. ALDH1A2
This protein belongs to the aldehyde dehydrogenase family of proteins. The product of this gene is an enzyme that catalyzes the synthesis of retinoic acid (RA) from retinaldehyde. Retinoic acid, the active derivative of vitamin A (retinol), is a hormonal signaling molecule that functions in developing and adult tissues. The studies of a similar mouse gene suggest that this enzyme and the cytochrome CYP26A1, concurrently establish local embryonic retinoic acid levels which facilitate posterior organ development and prevent spina bifida. Four transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, May 2011]

12. AGPS
This gene is a member of the FAD-binding oxidoreductase/transferase type 4 family. It encodes a protein that catalyzes the second step of ether lipid biosynthesis in which acyl-dihydroxyacetonephosphate (DHAP) is converted to alkyl-DHAP by the addition of a long chain alcohol and the removal of a long-chain acid anion. The protein is localized to the inner aspect of the peroxisomal membrane and requires FAD as a cofactor. Mutations in this gene have been associated with rhizomelic chondrodysplasia punctata, type 3 and Zellweger syndrome. [provided by RefSeq, Jul 2008]

13. ALDH9A1
This protein belongs to the aldehyde dehydrogenase family of proteins. It has a high activity for oxidation of gamma-aminobutyraldehyde and other amino aldehydes. The enzyme catalyzes the dehydrogenation of gamma-aminobutyraldehyde to gamma-aminobutyric acid (GABA). This isozyme is a tetramer of identical 54-kD subunits. [provided by RefSeq, Jul 2008]

14. ALDH4A1
This protein belongs to the aldehyde dehydrogenase family of proteins. This enzyme is a mitochondrial matrix NAD-dependent dehydrogenase which catalyzes the second step of the proline degradation pathway, converting pyrroline-5-carboxylate to glutamate. Deficiency of this enzyme is associated with type II hyperprolinemia, an autosomal recessive disorder characterized by accumulation of delta-1-pyrroline-5-carboxylate (P5C) and proline. Alternatively spliced transcript variants encoding different isoforms have been identified for this gene. [provided by RefSeq, Jun 2009]

15. ALDH6A1
This gene encodes a member of the aldehyde dehydrogenase protein family. The encoded protein is a mitochondrial methylmalonate semialdehyde dehydrogenase that plays a role in the valine and pyrimidine catabolic pathways. This protein catalyzes the irreversible oxidative decarboxylation of malonate and methylmalonate semialdehydes to acetyl- and propionyl-CoA. Methylmalonate semialdehyde dehydrogenase deficiency is characterized by elevated beta-alanine, 3-hydroxypropionic acid, and both isomers of 3-amino and 3-hydroxyisobutyric acids in urine organic acids. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Jun 2013]

16. ALDH3B1
This gene encodes a member of the aldehyde dehydrogenase protein family. Aldehyde dehydrogenases are a family of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The encoded protein is able to oxidize long-chain fatty aldehydes in vitro, and may play a role in protection from oxidative stress. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Feb 2014]

17. ALDH3B2
This gene encodes a member of the aldehyde dehydrogenase family, a group of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The gene of this particular family member is over 10 kb in length. Altered methylation patterns at this locus have been observed in spermatozoa derived from patients exhibiting reduced fecundity. [provided by RefSeq, Aug 2017]

18. ALDH16A1
This gene encodes a member of the aldehyde dehydrogenase superfamily. The family members act on aldehyde substrates and use nicotinamide adenine dinucleotide phosphate (NADP) as a cofactor. This gene is conserved in chimpanzee, dog, cow, mouse, rat, and zebrafish. The protein encoded by this gene interacts with maspardin, a protein that when truncated is responsible for Mast syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Apr 2010]

19. ALDH1L2
This gene encodes a member of both the aldehyde dehydrogenase superfamily and the formyl transferase superfamily. This member is the mitochondrial form of 10-formyltetrahydrofolate dehydrogenase (FDH), which converts 10-formyltetrahydrofolate to tetrahydrofolate and CO2 in an NADP(+)-dependent reaction, and plays an essential role in the distribution of one-carbon groups between the cytosolic and mitochondrial compartments of the cell. Alternatively spliced transcript variants have been found for this gene.[provided by RefSeq, Oct 2010]

20. ALDH8A1
This gene encodes a member of the aldehyde dehydrogenase family of proteins. The encoded protein has been implicated in the synthesis of 9-cis-retinoic acid and in the breakdown of the amino acid tryptophan. This enzyme converts 9-cis-retinal into the retinoid X receptor ligand 9-cis-retinoic acid, and has approximately 40-fold higher activity with 9-cis-retinal than with all-trans-retinal. In addition, this enzyme has been shown to catalyze the conversion of 2-aminomuconic semialdehyde to 2-aminomuconate in the kynurenine pathway of tryptophan catabolism. [provided by RefSeq, Jul 2018]

21. ALDH7A1P1
None

22. ALDH1L1-AS2
None

23. ALDH7A1P4
None

24. ALDH7A1P3
None

25. ALDH7A1P2
None

26. ALDH1A3-AS1
None

27. ALDH1A2-AS1
None

28. ALDH1L1-AS1
None

示例 3 – 接收所有生物体中的数千个基因(未过滤)

在提供的 Python 函数中设置

organism=None
并为同一查询 (
max_gene_ids=10000
) 设置
gene_name='ALDH*'
会导致 9010 个返回的基因 ID(即,目前 Entrez 基因数据库中所有生物体中有 9,010 个 ALDH 家族基因)。

例如:

>>> gene_summaries = get_entrez_gene_summary("ALDH*", email, organism=None, max_gene_ids=10000)
9010 gene IDs returned associated with gene ALDH*.
    Retrieving summary for 217...
    Retrieving summary for 216...
    Retrieving summary for 19378...
    Retrieving summary for 11669...
[...]

© www.soinside.com 2019 - 2024. All rights reserved.