BioPython KeyError

问题描述 投票:0回答:1

我是INTRO数据科学课程的MPH学生,并且具有编程的初学者知识。我正在Win32上运行Python 3.7.4(默认值,2019年8月9日,18:34:13)[MSC v.1915 64位(AMD64)] :: Anaconda,Inc.,并使用Pycharm作为我的IDE。我正在使用BioPython构建网络抓取工具,然后将结果保存在数据框中。抓取的代码是这样的:

from Bio import Entrez
import pandas

# gives a list of Citation IDs in response to a search word
def search(query):
    Entrez.email = '[email protected]'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='15',
                            retmode='xml',
                            datatype = 'pdat',
                            mindate = '2001/01/01',
                            maxdate = '2010/01/01',
                            term=(query)
                            )
    results = Entrez.read(handle)
    return results

# Fetch the details for all the retrieved articles via the fetch utility.
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = '[email protected]'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

if __name__ == '__main__':
    results = search('fever')
    id_list = results['IdList']
    papers = fetch_details(id_list)

然后保存到数据框,我有这个:

pmid = []
title = []
pubyear = []
abstract = []

for i, paper in enumerate(papers['PubmedArticle']):
    pm = paper['MedlineCitation']['PMID']
    pmid.append(str(pm))
    tit = paper['MedlineCitation']['Article']['ArticleTitle']
    title.append(tit)
    pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
    pubyear.append(pbyr)
    ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
    str(ab)
    abstract.append(str(ab))

# create empty dataframe
paper_df = pandas.DataFrame()

# add the PMID, Title, Publication Year, and Abstract columns
paper_df['Article_PMID'] = pmid
paper_df['Article_Title'] = title
paper_df['Publication_Year'] = pubyear
paper_df['Article_Abstract'] = abstract

我的问题变成...当我在esearch函数中的retmax参数只有15时,它工作正常。我得到15条记录,并填写了我需要的全部4条信息。但是,当我将其更改为16时,会出现错误。

追踪(最近通话):文件“ C:/Users/lztp/Documents/UT/1_PHM_2193_Intro_to_Data_Science/PyCharm_Projects/FP_Crawler_Module_1.py”在第69行中pbyr = paper ['MedlineCitation'] ['Article'] ['Journal'] ['JournalIssue'] ['PubDate'] ['Year']KeyError:“年份”

我的理解是,这意味着下一个记录中不存在“年份”吗?如何让它跳过缺少值的记录,而仅保存具有所需值的记录?我尝试在esearch的term参数中使用过滤器,但遇到另一个错误。有没有办法检查该值是否为空?或者,如果有人对如何执行此操作有任何想法,将不胜感激。

python biopython pubmed
1个回答
0
投票
for i, paper in enumerate(papers['PubmedArticle']):
    try:
        pm = paper['MedlineCitation']['PMID']
        tit = paper['MedlineCitation']['Article']['ArticleTitle']
        pbyr = paper['MedlineCitation']['Article']['Journal']['JournalIssue']['PubDate']['Year']
        ab = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
    except KeyError as e:
        continue
    pmid.append(str(pm))
    title.append(tit)
    pubyear.append(pbyr)
    abstract.append(str(ab))

只需使用try-catch即可处理。

© www.soinside.com 2019 - 2024. All rights reserved.