NotXMLError:无法解析 XML 数据

问题描述 投票:0回答:2

我正在尝试使用 Biopython 中的 Entrez 模块从 PubMed Central 检索全文文章。这是我做同样事情的代码。

import urllib3
import json
import requests
from Bio import Entrez
from Bio.Entrez import efetch, Parser
print(Parser.__file__)
pmcid = 'PMC2837563'

def print_text(pmcid):
    handle = efetch(db='pmc', id=pmcid, retmode='xml', rettype=None)
    #print(handle.read())
    record = Entrez.read(handle)
    print(record)

print_text(pmcid)


handle.read() 有效,这意味着数据已正确获取。但是,我无法将获取的数据转换为 python 对象。它给了我以下错误:

Entrez.read(handle)

有人可以告诉我该怎么办吗?根据 biopython 文档,这似乎是正确的语法。

python biopython pubmed rentrez pubmed-api
2个回答
0
投票
uri

http://www.niso.org/schemas/ali/1.0/ 的 DTD。 GitHub 版本具有更正的 Parser,但现在无法从 NotXMLError: Failed to parse the XML data (syntax error: line 1036, column 69). Please make sure that the input data are in XML format. 获得。 比较:

当前1.79

pip

GitHub

def startNamespaceDeclHandler(self, prefix, uri): """Handle start of an XML namespace declaration.""" if prefix == "xsi": # This is an xml schema self.schema_namespace = uri self.parser.StartElementHandler = self.schemaHandler else: # Note that the DTD for MathML specifies a default attribute # that declares the namespace for each MathML element. This means # that MathML element in the XML has an invisible MathML namespace # declaration that triggers a call to startNamespaceDeclHandler # and endNamespaceDeclHandler. Therefore we need to count how often # startNamespaceDeclHandler and endNamespaceDeclHandler were called # to find out their first and last invocation for each namespace. if prefix == "mml": assert uri == "http://www.w3.org/1998/Math/MathML" elif prefix == "xlink": assert uri == "http://www.w3.org/1999/xlink" else: raise ValueError("Unknown prefix '%s' with uri '%s'" % (prefix, uri)) self.namespace_level[prefix] += 1 self.namespace_prefix[uri] = prefix

因此您可以交换或编辑 
Parser.py

文件,或使用第三方库将句柄转换为内置 python 对象。

如果您只想下载文章全文,您可以尝试通过
def startNamespaceDeclHandler(self, prefix, uri): """Handle start of an XML namespace declaration.""" if prefix == "xsi": # This is an xml schema self.schema_namespace = uri self.parser.StartElementHandler = self.schemaHandler else: # Note that the DTD for MathML specifies a default attribute # that declares the namespace for each MathML element. This means # that MathML element in the XML has an invisible MathML namespace # declaration that triggers a call to startNamespaceDeclHandler # and endNamespaceDeclHandler. Therefore we need to count how often # startNamespaceDeclHandler and endNamespaceDeclHandler were called # to find out their first and last invocation for each namespace. if prefix == "mml": assert uri == "http://www.w3.org/1998/Math/MathML" elif prefix == "xlink": assert uri == "http://www.w3.org/1999/xlink" elif prefix == "ali": assert uri == "http://www.niso.org/schemas/ali/1.0/" else: raise ValueError(f"Unknown prefix '{prefix}' with uri '{uri}'") self.namespace_level[prefix] += 1 self.namespace_prefix[uri] = prefix

下载 pdf 并继续通过

metapub
提取文本。
textract



0
投票

© www.soinside.com 2019 - 2024. All rights reserved.