如何解析PubMed文本文件？

Question

我正在开发一个项目，我必须构建 SVM 分类器来根据文章标题和摘要中的单词来预测 MeSH 术语分配。我们获得了包含 1000 个 PMID 的 gzip 文件，用于标识每篇文章。下面是一个示例文件：

PMID- 22997744
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
LR  - 20120924
IS  - 0042-4676 (Print)
IS  - 0042-4676 (Linking)
IP  - 3
DP  - 2012 May-Jun
TI  - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
      cancer].
PG  - 28-33
AB  - To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology.
      Eighty patients with suspected recurrent colon tumor were examined. All the
      patients underwent irrigoscopy, colonoscopy, magnetic resonance imaging of the
      abdomen and small pelvis. The major magnetic resonance symptoms of recurrent
      colon tumors were studied; a differential diagnosis of recurrent processes and
      postoperative changes at the site of intervention was made.
FAU - Dan'ko, N A

MH  - Aged
MH  - Colon/pathology/surgery
MH  - Colorectal Neoplasms/*diagnosis/pathology/surgery
MH  - Diagnosis, Differential
MH  - Female
MH  - Humans
MH  - Magnetic Resonance Imaging/*methods
MH  - Male
MH  - Middle Aged
MH  - Neoplasm Recurrence, Local/*diagnosis
MH  - Postoperative Complications/*diagnosis
MH  - Rectum/pathology/surgery
MH  - Reproducibility of Results

我正在尝试弄清楚如何创建一本具有以下功能的字典：

{PMID: {Title (TI): Title words},
       {Abstract (AB): Abstract words},
       {MeSH (MH): MeSH terms}}.

有没有简单的方法可以做到这一点？到目前为止，我知道下面的代码很接近，但它并不完美。

class Node:
    def __init__(self, indented_line):
        self.children = []
        self.level = len(indented_line) - len(indented_line.lstrip())
        self.text = indented_line.strip()

    def add_children(self, nodes):
        childlevel = nodes[0].level
        while nodes:
            node = nodes.pop(0)
            if node.level == childlevel: # add node as a child
                self.children.append(node)
            elif node.level > childlevel: # add nodes as grandchildren of the last child
                nodes.insert(0,node)
                self.children[-1].add_children(nodes)
            elif node.level <= self.level: # this node is a sibling, no more children
                nodes.insert(0,node)
                return

    def as_dict(self):
        if len(self.children) > 1:
            return {self.text: [node.as_dict() for node in self.children]}
        elif len(self.children) == 1:
            return {self.text: self.children[0].as_dict()}
        else:
            return self.text

# Problem A [0 points]
def read_data(filenames):
    data = None
    # Begin CODE
    data = {}
    contents = []
    for filename in filenames:
        with gzip.open(filename,'rt') as f:
            contents.append(f.read())

    root = Node('root')
    root.add_children([Node(line) for line in contents[0].splitlines() if line.strip()])
    d = root.as_dict()['root']
    print(d[:50])
    # End CODE
    return data

Answer 1

让我们将示例简化为更简单的内容：

content = """
PMID- 22997744
OWN - NLM
TI  - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
      cancer].
PG  - 28-33
AB  - To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology.
      Eighty patients with suspected recurrent colon tumor were examined.
FAU - Dan'ko, N A

MH  - Aged
MH  - Colon/pathology/surgery"""

您可以使用正则表达式来匹配模式。正则表达式是一个深入而强大的工具：

>>> match = re.search('^PMID- (.*)$', content, re.MULTILINE)

模式

^PMID- (.*)$

匹配行的开头

，后跟

PMID-

，然后是多个字符

，然后是行的结尾

。括号

(.*)

表示括号内匹配的结果将被放在一组中。我们需要检查是否存在匹配项：

>>> match is not None
True

我们可以查询比赛情况：

>>> match.groups()
('22997744',)

所以，我们可以看到有一组（因为我们在模式中只定义了一组），并且它与 PMID 匹配：

22997744

。

我们可以通过请求匹配组 1 的值来获取该值。匹配组 0 是匹配的整个字符串：

PMID- 22997744

。

>>> pmid = match.group(1)
>>> pmid
'22997744'

使用

TI

和

AB

进行多行匹配的模式要困难得多。我不是专家，也许其他人会提供更好的东西。我只是先进行文本替换，所以所有文本都在一行上。例如：

>>> text = """TI  - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
...       cancer].

>>> print(text)
TI  - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
      cancer].

>>> print(text.replace('\n      ', ' '))
TI  - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].

然后我们可以以类似的方式匹配

TI

和

AB

：

>>> content = content.replace('\n      ', ' ')

>>> match = re.search('^TI  - (.*)$', content, re.MULTILINE)
>>> ti = match.group(1)
>>> ti
'[Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].'

>>> match = re.search('^AB  - (.*)$', content, re.MULTILINE)
>>> ab = match.group(1)
>>> ab
'To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology. Eighty patients with suspected recurrent colon tumor were examined'

为了匹配

MH

，我们想要找到所有匹配项。

re.search

只会找到第一个匹配项。

re.findall

将返回多个匹配项：

>>> mh = re.findall('^MH  - (.*)$', content, re.MULTILINE)
>>> mh
['Aged', 'Colon/pathology/surgery']

将所有这些放在一起：

data = {}

data[pmid] = {'Title': ti,
              'Abstract': ab,
              'MeSH': mh}

这给出了（使用

pprint

使其看起来更好）：

>>> from pprint import pprint
>>> pprint(data)
{'22997744': {'Abstract': 'To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology. Eighty patients with suspected recurrent colon tumor were examined.',
              'MeSH': ['Aged', 'Colon/pathology/surgery'],
              'Title': '[Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].'}}

Answer 2

您可以尝试使用我编写的以下代码来解析 PubMed 中的摘要、标题和 MeSH (MH)：

import re
import pandas as pd

# Load the text file
with open('pubmed-NaturalLan-set.txt', 'r') as file:
    text = file.read()

# Split the text into individual publications
publications = text.split('\n\n')

# Initialize empty lists to store the data
abstracts = []
titles = []
keywords = []

for publication in publications:
    # Extract abstract
    ab_start = publication.find('AB  - ')
    ab_end = publication.find('CI  - ', ab_start)
    abstract = publication[ab_start + 6:ab_end]
    abstract = re.sub(r'\s+', ' ', abstract).strip()
    abstracts.append(abstract)

    # Extract title
    ti_start = publication.find('TI  - ')
    ti_end = publication.find('PG  - ', ti_start)
    title = publication[ti_start + 6:ti_end]
    title = re.sub(r'\s+', ' ', title).strip()
    titles.append(title)

    # Extract keywords
    mh_start = publication.find('MH  - ')
    mh_end = publication.find('PMC  - ')
    mh_lines = publication[mh_start:mh_end].split('\n')
    _keywords = []
    for mh_line in mh_lines:
        if mh_line.startswith('MH  - '):
            keyword = re.sub(r'\*', '', mh_line[5:].strip())
            _keywords.append(keyword)
    keywords.append(_keywords)

# Create the DataFrame
pubMed_df_keywords = pd.DataFrame({
    'Abstract': abstracts,
    'Article Title': titles,
    'Keywords': keywords
})

如何解析PubMed文本文件？

问题描述投票：0回答：2

2个回答

最新问题

如何解析PubMed文本文件？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2