从文本中提取表格

问题描述 投票:0回答:1

我写了如下代码来提取表格。它部分工作,但有 2 个错误。

  • 1- 它没有捕获最后一行数据。
  • 2-如果有的话,它不会捕获标签的值。

此外,

  • 3-如果存在缺失值,我也想捕获该值

你能帮我优化一下代码吗?

import pandas as pd
import re

data = {'AG0': ': Age in \n- 2 -Value Label Unweighted\nFrequency%\n42- 367 11.1 %\n43- 421 12.7 %\n44- 416 12.6 %\n45- 389 11.8 %\n46- 400 12.1 %\n47- 392 11.9 %\n48- 299 9.1 %\n49- 255 7.7 %\n50- 168 5.1 %\n51- 115 3.5 %\n52- 71 2.2 %\n53- 40.1 %\n Missing Data   \n.- 50.2 %\n Total 3,302 100%\nBased upon 3,297 valid cases out of 3,302 total cases.\n•Mean: 45.85\n•Median: 46.00\n•Mode: 43.00\n•Minimum: 42.00\n•Maximum: 53.00\n•Standard Deviation: 2.69\nLocation: 9-10 (width: 2; decimal: 0)\nVariable Type:  numeric \n'}

import re
import pandas as pd

# Extract table from string
pattern = r'(\d+)-\s+(\d+)\s+(\d+\.\d+)\s*%'
table_data = re.findall(pattern, data['AG0'])

# Create DataFrame
df = pd.DataFrame(table_data, columns=['Value', 'Unweighted Frequency',"%"])

# Print DataFrame
print(df)

输出:

   Value Unweighted Frequency     %
0     42                  367  11.1
1     43                  421  12.7
2     44                  416  12.6
3     45                  389  11.8
4     46                  400  12.1
5     47                  392  11.9
6     48                  299   9.1
7     49                  255   7.7
8     50                  168   5.1
9     51                  115   3.5
10    52                   71   2.2

更新: 我怎样才能让上面的代码适用于这个有标签的文本?

 data2 = {'PRE': ': Currently ?\nAre you currently pregnant?\nValue Label Unweighted\nFrequency%\n1No 3295 99.8 %\n2Yes 00.0 %\n Missing Data   \n-9Missing 70.2 %\n Total 3,302 100%\nBased upon 3,295 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 1.00\nLocation: 11-12 (width: 2; decimal: 0)\nVariable Type:  numeric \n- 3 -(Range of) Missing Values:  -9 , -8 , -7 , -1\n',}

更新: 我怎样才能让上面的代码适用于这个值为字符串的文本?

data3 ={'F3: 尝试 血液尝试? 价值标签未加权 频率% 1 是,根据方案 2745 83.1 % 2是的,月经变化太大 91 2.8 % - 5 - 未加权的值标签 频率% 3是,最后一次尝试 405 12.3 % 4否,未禁食和/或不在窗口期 10.0 % 缺失数据 -9 缺失 16 0.5 % -1N/A 20.1 % .- 42 1.3 % 总计 3,302 100% 基于 3,302 个案例中的 3,242 个有效案例。 •最低:1.00 •最高:4.00 位置:21-22(宽度:2;小数:0) 变量类型:数字 (范围)缺失值:-9、-8、-7、-1、。 }

python pandas regex nlp regular-language
1个回答
0
投票

您可以尝试使用以下 RegEx 来收集命名的捕获组,您可以在以后轻松处理这些捕获组,如果它是

None
则排除标签,如果存在有效值则包括它。

import re

data = {'AG0': ': Age in \n- 2 -Value Label Unweighted\nFrequency%\n42- 367 11.1 %\n43- 421 12.7 %\n44- 416 12.6 %\n45- 389 11.8 %\n46- 400 12.1 %\n47- 392 11.9 %\n48- 299 9.1 %\n49- 255 7.7 %\n50- 168 5.1 %\n51- 115 3.5 %\n52- 71 2.2 %\n53- 40.1 %\n Missing Data   \n.- 50.2 %\n Total 3,302 100%\nBased upon 3,297 valid cases out of 3,302 total cases.\n•Mean: 45.85\n•Median: 46.00\n•Mode: 43.00\n•Minimum: 42.00\n•Maximum: 53.00\n•Standard Deviation: 2.69\nLocation: 9-10 (width: 2; decimal: 0)\nVariable Type:  numeric \n'}

pattern = r'(?P<value>\d+)-?(?P<label>\S+)?\s+(?P<frequency>\d+)\s+(?P<percent>\d+\.\d+)\s*%'

groups = [ m.groupdict() for line in data.get('AG0').split(sep='\n') if (m := re.match(pattern, line)) ]
groups

结果:

[{'value': '42', 'label': None, 'frequency': '367', 'percent': '11.1'},
 {'value': '43', 'label': None, 'frequency': '421', 'percent': '12.7'},
 {'value': '44', 'label': None, 'frequency': '416', 'percent': '12.6'},
 {'value': '45', 'label': None, 'frequency': '389', 'percent': '11.8'},
 {'value': '46', 'label': None, 'frequency': '400', 'percent': '12.1'},
 {'value': '47', 'label': None, 'frequency': '392', 'percent': '11.9'},
 {'value': '48', 'label': None, 'frequency': '299', 'percent': '9.1'},
 {'value': '49', 'label': None, 'frequency': '255', 'percent': '7.7'},
 {'value': '50', 'label': None, 'frequency': '168', 'percent': '5.1'},
 {'value': '51', 'label': None, 'frequency': '115', 'percent': '3.5'},
 {'value': '52', 'label': None, 'frequency': '71', 'percent': '2.2'}]

第二个数据字典示例:

import re

pattern = r'(?P<value>\d+)-?(?P<label>\S+)?\s+(?:(?P<frequency>\d+)\s+)?(?P<percent>\d+\.\d+)\s*%'

data2 = {'PRE': ': Currently ?\nAre you currently pregnant?\nValue Label Unweighted\nFrequency%\n1No 3295 99.8 %\n2Yes 00.0 %\n Missing Data   \n-9Missing 70.2 %\n Total 3,302 100%\nBased upon 3,295 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 1.00\nLocation: 11-12 (width: 2; decimal: 0)\nVariable Type:  numeric \n- 3 -(Range of) Missing Values:  -9 , -8 , -7 , -1\n',}

groups2 = [ m.groupdict() for line in data2.get('PRE').split(sep='\n') if (m := re.match(pattern, line)) ]

groups2

结果:

[{'value': '1', 'label': 'No', 'frequency': '3295', 'percent': '99.8'},
 {'value': '2', 'label': 'Yes', 'frequency': None, 'percent': '00.0'}]

您可能希望在两个示例中使用的第一个和第二个 RegEx 之间存在细微差别。我在第一个例子中排除了,因为正如@Swifty 提到的,值 53 的记录看起来很可疑。

(?:(?P<frequency>\d+)\s+)?

[编辑] 根据您提供的第三个数据字典,必须对 RegEx 进行小的修改:

(?P<label>\D+)

第三个例子:

data3 ={'F3': 'Blood draw attempted\nBlood draw attempted?\nValue Label Unweighted\nFrequency%\n1Yes, as per protocol 2745 83.1 %\n2Yes, menses too variable 91 2.8 %\n- 5 -Value Label Unweighted\nFrequency%\n3Yes, Last attempt 405 12.3 %\n4No, Not fasting and/or not in window 10.0 %\n Missing Data \n-9Missing 16 0.5 %\n-1N/A 20.1 %\n.- 42 1.3 %\n Total 3,302 100%\nBased upon 3,242 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 4.00\nLocation: 21-22 (width: 2; decimal: 0)\nVariable Type: numeric \n(Range of) Missing Values: -9 , -8 , -7 , -1 , .\n'}

pattern = r'(?P<value>\d+)-?(?P<label>\D+)?\s+(?:(?P<frequency>\d+)\s+)?(?P<percent>\d+\.\d+)\s*%'

groups3 = [ m.groupdict() for line in data3.get('F3').split(sep='\n') if (m := re.match(pattern, line)) ]

groups3

输出:

[{'value': '1',
  'label': 'Yes, as per protocol',
  'frequency': '2745',
  'percent': '83.1'},
 {'value': '2',
  'label': 'Yes, menses too variable',
  'frequency': '91',
  'percent': '2.8'},
 {'value': '3',
  'label': 'Yes, Last attempt',
  'frequency': '405',
  'percent': '12.3'},
 {'value': '4',
  'label': 'No, Not fasting and/or not in window',
  'frequency': None,
  'percent': '10.0'}]
© www.soinside.com 2019 - 2024. All rights reserved.