我在 xml 中的每一行前面收到一个字节类型,我已经修剪了该字节类型,但是任何解析器都无法读取该 xml。如何解析pmc xml?

问题描述 投票:0回答:1

我试图提取与搜索查询匹配的整个 PMC 全文文章,然后我得到 IDList。然后将 IDList 向下传递到 Efetch 以获得响应。

响应格式不是 MEDLINE xml,而是全文 XML。 xml 总是以“b'”开头。我已经使用一段代码删除了它,尽管即使在清理之后,我在使用 xml.etree 时仍然遇到解析错误。

我用来获取的代码:

from Bio import Medline
handle = Entrez.efetch(db="pmc", id='10888568',retmode="xml")
records = Medline.parse(handle)
records = list(records)
handle.close()

我用来清理的代码:

cleaned_data = {key.decode(): [value.decode() for value in values] for key, values in records[0].items()}

我用来读取 xml 的代码:

import xml.etree.ElementTree as ET

# Parse XML data
root = ET.fromstring(cleaned_data)

# Extract text under the <article> tag
articles_text = []
for article in root.findall('.//article'):
    article_text = ET.tostring(article, method='text', encoding='unicode').strip()
    articles_text.append(article_text)

# Print the extracted text of each article
for idx, article_text in enumerate(articles_text, start=1):
    print(f"Article {idx}:\n{article_text}\n")

出现以下错误:

Traceback (most recent call last):

  File ~/miniconda3/envs/pubmed/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3577 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[42], line 4
    root = ET.fromstring(cleaned_data)

  File ~/miniconda3/envs/pubmed/lib/python3.12/xml/etree/ElementTree.py:1330 in XML
    parser.feed(text)

  File <string>
ParseError: not well-formed (invalid token): line 1, column 6

清理前参考的一小部分xml:

{b'
<?xm': [b'version="1.0" ?>'], b'
<!DO': [b'YPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">'], b'
<pmc': [b'rticleset>
    <article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.3" xml:lang="en" article-type="review-article">'], b'
        <?': [b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>', b'operties open_access?>'], b'
        <p': [b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">', b'cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">'], b'': [b'estricted-by>pmc
        </restricted-by>', b'ec sec-type="intro" id="sec1-ijms-25-01506">', b'
        <title>1. Introduction</title>', b'
        <sec id="sec1dot1-ijms-25-01506">', b'
            <title>1.1. Introduction to Rheumatoid Arthritis</title>', b'
            <p>Rheumatoid arthritis (RA) is an autoimmune condition that primarily affects the joints, as well as connective tissue, muscles, and tendons. This disease has a prevalence of 0.54% in Europe [
                <xref rid="B1-ijms-25-01506" ref-type="bibr">1</xref>], typically appearing around the age of 45 and triplicating its incidence in women in comparison with men [
                <xref rid="B2-ijms-25-01506" ref-type="bibr">2</xref>,
                <xref rid="B3-ijms-25-01506" ref-type="bibr">3</xref>].
            </p>', b'
            <p>In this disease, there is an inflammation of the synovial membrane that covers the joints (referred to as synovitis) and the development of invasive synovial tissue known as pannus, which, over time, leads to the degradation of cartilage, bone, and the joint itself [
                <xref rid="B4-ijms-25-01506" ref-type="bibr">4</xref>]. The pannus is abnormally highly vascularized, promoting a pro-inflammatory environment that contributes to greater joint degradation [
                <xref rid="B5-ijms-25-01506" ref-type="bibr">5</xref>].
            </p>', b'
            <p>RA presents four stages based on symptom intensity and exacerbation: early, moderate, severe, and end-stage. The early phase is characterized by joint inflammation and stiffness, possibly accompanied by symptoms like fever, fatigue, and loss of appetite. In the moderate stage, joint cartilage is inflamed, leading to pain and reduced mobility. The severe stage involves bone inflammation, worsening the previously mentioned symptoms and potentially causing bone erosion. Patients at this stage may also experience muscle weakness or atrophy. Ultimately, in the end-stage, joints undergo complete erosion, losing their ability to facilitate bone mobility. Additionally, due to these changes, bone fusion, known as ankylosis, can occur, resulting in a definitive loss of function [
                <xref rid="B6-ijms-25-01506" ref-type="bibr">6</xref>]. However, not all RA patients experience all four stages, as treatment with disease-modifying antirheumatic drugs (DMARDs) aims for remission, reducing symptoms to enable a normal life [
                <xref rid="B7-ijms-25-01506" ref-type="bibr">7</xref>].
            </p>', b'
            <p>Regarding its causes, there is no single reason for this disease; rather, it results from a combination of genetic, epigenetic, microbiome-related, environmental, and immunological factors. One of the most significant genetic factors is human leukocyte antigen (HLA) polymorphism. Specifically, the HLA-DRB1*04 variant has been associated with the presence of rheumatoid nodules [
                <xref rid="B8-ijms-25-01506" ref-type="bibr">8</xref>,
                <xref rid="B9-ijms-25-01506" ref-type="bibr">9</xref>]. In terms of epigenetic factors, the presence of important enzymes like fat mass and obesity-associated protein (FTO), which modifies N6-methyladenosine (m6A) methylation and acts as a demethylase [
                <xref rid="B10-ijms-25-01506" ref-type="bibr">10</xref>], has been described. However, in RA patients, its function is reduced, leading to higher levels of m6A in peripheral blood [
                <xref rid="B11-ijms-25-01506" ref-type="bibr">11</xref>]. Moreover, authors have found a relationship between altered FTO levels, contents of another m6A demethylase enzyme called AlkB homolog 5 (ALKBH5) [
                <xref rid="B12-ijms-25-01506" ref-type="bibr">12</xref>], the enzyme recognizing m6A modifications called YTH N6-methyladenosine RNA binding protein F2 (YTHFD2) [
                <xref rid="B13-ijms-25-01506" ref-type="bibr">13</xref>], and RA activity [
                <xref rid="B11-ijms-25-01506" ref-type="bibr">11</xref>].
            </p>', b'

请注意这 2 个 xml 属于不同的文章。此外,XML 非常长,有 8k+ 行,请使用这些 XML 例外情况获得直觉。

清理后的XML:

{'
<?xm': ['version="1.0" ?>'], '
<!DO': ['YPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">'], '
<pmc': ['rticleset>
    <article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.3" xml:lang="en" article-type="review-article">'], '
        <?': ['operties open_access?>'], '
        <p': ['cessing-meta base-tagset="archiving" mathml-version="3.0" table-model="xhtml" tagset-family="jats">'], '': ['estricted-by>pmc
        </restricted-by>', 'ec sec-type="intro" id="sec1-ijms-25-02212">', '
        <title>1. Introduction</title>', '
        <p>Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) has been one of the most significant global threats for almost four years, which has revolutionized medicine and social life&#x2014;in other words, almost everything [
            <xref rid="B1-ijms-25-02212" ref-type="bibr">1</xref>,
            <xref rid="B2-ijms-25-02212" ref-type="bibr">2</xref>]. Even though the number of coronavirus disease 2019 (COVID-19) cases has exceeded 773 million worldwide (according to the World Health Organization&#x2019;s statistics; 24 December 2023) and our knowledge concerning its possible course and accompanying late consequences is more and more detailed, we are still discovering new aspects of this phenomenon [
            <xref rid="B3-ijms-25-02212" ref-type="bibr">3</xref>]. Initially, COVID-19 was perceived as a disease of the respiratory tract. Nonetheless, the list of possible extrapulmonary symptoms and complications became longer and longer [
            <xref rid="B4-ijms-25-02212" ref-type="bibr">4</xref>,
            <xref rid="B5-ijms-25-02212" ref-type="bibr">5</xref>,
            <xref rid="B6-ijms-25-02212" ref-type="bibr">6</xref>,
            <xref rid="B7-ijms-25-02212" ref-type="bibr">7</xref>,
            <xref rid="B8-ijms-25-02212" ref-type="bibr">8</xref>]. Finally, the spectrum of COVID-19-related manifestations turned out to be associated with a broad range of systemic symptoms, e.g., neurological and dermatological syndromes, myocardial dysfunction, hyperglycemia, disorders of the gastrointestinal tract and acute kidney failure [
            <xref rid="B9-ijms-25-02212" ref-type="bibr">9</xref>,
            <xref rid="B10-ijms-25-02212" ref-type="bibr">10</xref>,
            <xref rid="B11-ijms-25-02212" ref-type="bibr">11</xref>,
            <xref rid="B12-ijms-25-02212" ref-type="bibr">12</xref>,
            <xref rid="B13-ijms-25-02212" ref-type="bibr">13</xref>,
            <xref rid="B14-ijms-25-02212" ref-type="bibr">14</xref>]. Concomitantly, a lot of observations also started to suggest an existing relationship between SARS-CoV-2 and chronic liver diseases (CLDs), indicating hepatic manifestations of the infection and the exacerbation of existing liver pathologies in the case of COVID-19 [
            <xref rid="B15-ijms-25-02212" ref-type="bibr">15</xref>,
            <xref rid="B16-ijms-25-02212" ref-type="bibr">16</xref>,
            <xref rid="B17-ijms-25-02212" ref-type="bibr">17</xref>,
            <xref rid="B18-ijms-25-02212" ref-type="bibr">18</xref>,
            <xref rid="B19-ijms-25-02212" ref-type="bibr">19</xref>,
            <xref rid="B20-ijms-25-02212" ref-type="bibr">20</xref>,
            <xref rid="B21-ijms-25-02212" ref-type="bibr">21</xref>]. Further, pre-existing systemic disorders were also proved to modify the natural history of COVID-19, usually exacerbating its course and resulting in a more complex presentation of the infection. Special attention should be paid to patients with cardiovascular pathologies, diabetes and dyslipidemia. Insulin resistance and impaired immune response are usually enumerated in such circumstances as participating cofactors. Of note, an inflammatory background might impair the production of adiponectin within visceral adipose tissue, progressing underlying inflammation. These speculations have already been confirmed by numerous findings of conducted investigations showing that the population of COVID-19 patients with lower expression of adiponectin is more prone to developing respiratory failure [
            <xref rid="B22-ijms-25-02212" ref-type="bibr">22</xref>]. A notable expression of a metalloproteinase, angiotensin-converting enzyme 2 (ACE2), a functional receptor for SARS-CoV-2, in the kidneys might be related to increased risk of acute tubular necrosis. Pre-existing pulmonary disabilities constitute another undeniable condition increasing a potential outcome of ongoing COVID-19. Simultaneously,
            <italic toggle="yes">the gut&#x2013;lung axis</italic> reflects a bidirectional relationship between residents of the bacterial flora within the gastrointestinal and respiratory tracts. Thus, dysbiosis (quite often related to treatment with antibiotics and steroids) might alter the integrity of the intestinal barrier, resulting in an increased risk of secondary infections. Secondly, pulmonary-derived hypoxia can induce necrotic lesions in the cells of the gut. On the other hand, neurological manifestations in the early stages of COVID-19 may be perceived as indicators of poor clinical outcome. Furthermore, coexisting inflammation-induced hypercoagulability predisposes patients to developing strokes. Another crucial aspect of potential implications due to SARS-CoV-2 infection is related to possible autoimmune cross-reactions. As a result, idiopathic thrombocytopenic purpura or autoimmune hemolytic anemia might be developed [
            <xref rid="B23-ijms-25-02212" ref-type="bibr">23</xref>]. A primary pulmonary disorder turned out to be a multiorgan complex pathology&#x2014;this is the major lesson of the pandemic. Due to the notable involvement of the liver in the systemic manifestation of COVID-19, a lot of investigations were conducted to explore this relationship. Simultaneously, the context of alcohol dependence and the progression of ALD during the pandemic were observed. The aforementioned phenomena are essential from clinical and social perspectives. Because the data on these issues are not fully systemized, we decided to explore the available literature and present the current state of knowledge concerning COVID-19, alcohol abuse and possible hepatic complications in the most comprehensive way.
        </p>', 'sec>', 'ec id="sec2-ijms-25-02212">', '
        <title>2. Hepatic Face of Novel Coronavirus Infection</title>', '
        <p>The liver has been described as the second organ after the lungs to be involved in the course of the disease, resulting in hepatobiliary complications among up to 29% of patients [
            <xref rid="B24-ijms-25-02212" ref-type="bibr">24</xref>,
            <xref rid="B25-ijms-25-02212" ref-type="bibr">25</xref>,
            <xref rid="B26-ijms-25-02212" ref-type="bibr">26</xref>,
            <xref rid="B27-ijms-25-02212" ref-type="bibr">27</xref>]. Chen et al. found patients infected with coronavirus and coexisting abnormalities in liver tests to present a greater risk of systemic inflammatory response syndrome (SIRS) and higher overall mortality rates [
            <xref rid="B28-ijms-25-02212" ref-type="bibr">28</xref>]. Another hepatic perspective related to COVID-19 is the risk of adverse events that may occur in the course of treatment; in the majority of cases, they are reflected by mild hypertransaminasemia [
            <xref rid="B29-ijms-25-02212" ref-type="bibr">29</xref>]. In the scope of these speculations, we decided to present gathered data of patients suffering from alcohol-related liver disease (ALD) who were infected with SARS-CoV-2. The manifestation of infection by coronavirus mainly concerns symptoms related to the respiratory tract. However, in the course of the pandemic, a more and more complex nature of COVID-19 started to appear [
            <xref rid="B30-ijms-25-02212" ref-type="bibr">30</xref>]. It turned out that a greater number of coexisting conditions among infected individuals (e.g., cardiovascular disorders, kidney failure, diabetes, cancer, obesity, neurodegenerative diseases and alcohol consumption) may predispose them to a more severe course of the infection [
            <xref rid="B31-ijms-25-02212" ref-type="bibr">31</xref>,
            <xref rid="B32-ijms-25-02212" ref-type="bibr">32</xref>,
            <xref rid="B33-ijms-25-02212" ref-type="bibr">33</xref>,
            <xref rid="B34-ijms-25-02212" ref-type="bibr">34</xref>,
            <xref rid="B35-ijms-25-02212" ref-type="bibr">35</xref>]. From 2020&#x2013;2022, at least 2336 manuscripts focused on coronavirus and its hepatic implications were published, and we are still discovering new aspects in this area [
            <xref rid="B36-ijms-25-02212" ref-type="bibr">36</xref>]. Therefore, the involvement of the liver in the presentation of COVID-19 is still an issue of great importance, requiring further investigations. Simultaneously, the pandemic and lockdowns were related to the increased consumption of alcohol in society. This combination created the background to perceive ALD as a significant underlying factor in the natural history of coronavirus infection. Due to these considerations, we decided to gather already collected data on COVID-19, ethanol and ALD in a single manuscript.
        </p>', 'sec>', 'ec id="sec3-ijms-25-02212">', '
        <title>3. SARS-CoV-2 and Liver&#x2014;Direct or Indirect Implications?</title>', '
        <p>Extrapulmonary manifestation of SARS-CoV-2 infection might even overtake pulmonary presentation [
            <xref rid="B37-ijms-25-02212" ref-type="bibr">37</xref>,
            <xref rid="B38-ijms-25-02212" ref-type="bibr">38</xref>,
            <xref rid="B39-ijms-25-02212" ref-type="bibr">39</xref>]. Usually, the presentation with gastrointestinal symptoms involves the increased prevalence of hypertransaminasemia and liver injury. This is mainly due to the hepatic expression of metalloproteinase&#x2014;angiotensin-converting enzyme 2 (ACE2)&#x2014;a functional receptor for SARS-CoV-2 and its spike-I glycoprotein [
            <xref rid="B40-ijms-25-02212" ref-type="bibr">40</xref>,
            <xref rid="B41-ijms-25-02212" ref-type="bibr">41</xref>,
            <xref rid="B42-ijms-25-02212" ref-type="bibr">42</xref>]. It was described in cholangiocytes and hepatocytes. During this initial stage of the infection, the renin&#x2013;angiotensin system and peroxisome proliferator-activated receptor signaling pathway can be perceived as triggering factors. Thus, the liver cytopathic injury might be developed as a direct result of infection [
            <xref rid="B27-ijms-25-02212" ref-type="bibr">27</xref>,
            <xref rid="B43-ijms-25-02212" ref-type="bibr">43</xref>,
            <xref rid="B44-ijms-25-02212" ref-type="bibr">44</xref>,
            <xref rid="B45-ijms-25-02212" ref-type="bibr">45</xref>,
            <xref rid="B46-ijms-25-02212" ref-type="bibr">46</xref>,
            <xref rid="B47-ijms-25-02212" ref-type="bibr">47</xref>]. Simultaneously, a developing SIRS together with coexisting hepatic anoxia due to SARS-CoV-2-related respiratory failure can indirectly impair liver function The phenomenon of hepatic disorders in the overall picture of COVID-19 concerns from 2.5% up to 45.71% individuals and this wide range can be explained by the presence of diversified subpopulations among CLD patients and different cut-offs of liver test results applied in studies. In the majority of cases, hepatic manifestation of COVID-19 is asymptomatic and its only visible proof is elevation in liver enzymes. According to previous analyses, these disturbances concern mainly aminotransferases (approximately 20% of COVID-19 patients); about 15% of cases might present with an increased level of gamma-glutamyltransferase (GGT), 9.7% of cases with bilirubin and 4% of cases with alkaline phosphatase (ALP) [
            <xref rid="B48-ijms-25-02212" ref-type="bibr">48</xref>,
            <xref rid="B49-ijms-25-02212" ref-type="bibr">49</xref>,
            <xref rid="B50-ijms-25-02212" ref-type="bibr">50</xref>]. The potential relationship between COVID-19 and liver disorders might be considered from at least two perspectives. Infection with SARS-CoV-2 can manifest with liver disorders or exacerbate already existing hepatic problems. Metabolic-associated fatty liver disease (MAFLD) turned out to be a triggering factor for COVID-19 [
            <xref rid="B51-ijms-25-02212" ref-type="bibr">51</xref>,
            <xref rid="B52-ijms-25-02212" ref-type="bibr">52</xref>]. Alterations in immune functions among individuals with liver steatosis are commonly seen and reflected by the increased concentration of interleukin (IL) 6 in their blood [
            <xref rid="B53-ijms-25-02212" ref-type="bibr">53</xref>]. After the lungs, the liver is the second organ most frequently affected by COVID-19 [
            <xref rid="B54-ijms-25-02212" ref-type="bibr">54</xref>]. Concurrently, individuals with chronic or acute liver disorders are prone to developing exacerbation due to coronavirus infection; this scheme was already confirmed [
            <xref rid="B55-ijms-25-02212" ref-type="bibr">55</xref>]. It is assumed that about 3% of all COVID-19 patients suffer from underlying CLD [
            <xref rid="B56-ijms-25-02212" ref-type="bibr">56</xref>]. Furthermore, in the case of autoimmune liver pathologies, the treatment based on immunosuppressants may constitute a triggering factor in liver injury after the development of infection with SARS-CoV-2 [
            <xref rid="B57-ijms-25-02212" ref-type="bibr">57</xref>]. Regardless of the basic status of the patient (with or without previous liver failure), the management of COVID-19 patients should routinely include the assessment of liver function tests (LFTs).
        </p>', 'sec>', 'ec id="sec4-ijms-25-02212">', '

python xml elementtree biopython pubmed
1个回答
0
投票

您遇到的错误“ParseError:格式不正确(无效标记):第 1 行,第 6 列”是因为数据“b”中的第一个字符不是有效的 XML。该字符可能源自您编码数据的方式。

以下是如何修复代码以正确处理编码和解析 XML:

import xml.etree.ElementTree as ET
from Bio import Entrez

def fetch_and_parse_pmc_xml(pmcid):
    """Fetches and parses PMC XML for a given PMC ID.

    Args:
      pmcid: The PMC ID of the article to fetch.

    Returns:
      A list of article texts.
    """
    Entrez.email = "[email protected]"  # Provide your email for Entrez requests
    handle = Entrez.efetch(db="pmc", id=pmcid, retmode="xml")
    data = handle.read().decode("utf-8")  # Decode data as UTF-8
    handle.close()  # Close the handle
    root = ET.fromstring(data.strip())  # Remove leading/trailing whitespace

    articles_text = []
    for article in root.findall('.//article'):
        article_text = ET.tostring(article, method='text', encoding='unicode').strip()
        articles_text.append(article_text)

    return articles_text

# Example usage
pmcid = "10888568"  # Replace with your desired PMC ID
articles_text = fetch_and_parse_pmc_xml(pmcid)

for idx, article_text in enumerate(articles_text, start=1):
    print(f"Article {idx}:\n{article_text}\n")
© www.soinside.com 2019 - 2024. All rights reserved.