我写了如下代码来提取表格。它部分工作,但有 2 个错误。
此外,
你能帮我优化一下代码吗?
import pandas as pd
import re
data = {'AG0': ': Age in \n- 2 -Value Label Unweighted\nFrequency%\n42- 367 11.1 %\n43- 421 12.7 %\n44- 416 12.6 %\n45- 389 11.8 %\n46- 400 12.1 %\n47- 392 11.9 %\n48- 299 9.1 %\n49- 255 7.7 %\n50- 168 5.1 %\n51- 115 3.5 %\n52- 71 2.2 %\n53- 40.1 %\n Missing Data \n.- 50.2 %\n Total 3,302 100%\nBased upon 3,297 valid cases out of 3,302 total cases.\n•Mean: 45.85\n•Median: 46.00\n•Mode: 43.00\n•Minimum: 42.00\n•Maximum: 53.00\n•Standard Deviation: 2.69\nLocation: 9-10 (width: 2; decimal: 0)\nVariable Type: numeric \n'}
import re
import pandas as pd
# Extract table from string
pattern = r'(\d+)-\s+(\d+)\s+(\d+\.\d+)\s*%'
table_data = re.findall(pattern, data['AG0'])
# Create DataFrame
df = pd.DataFrame(table_data, columns=['Value', 'Unweighted Frequency',"%"])
# Print DataFrame
print(df)
输出:
Value Unweighted Frequency %
0 42 367 11.1
1 43 421 12.7
2 44 416 12.6
3 45 389 11.8
4 46 400 12.1
5 47 392 11.9
6 48 299 9.1
7 49 255 7.7
8 50 168 5.1
9 51 115 3.5
10 52 71 2.2
更新: 我怎样才能让上面的代码适用于这个有标签的文本?
data2 = {'PRE': ': Currently ?\nAre you currently pregnant?\nValue Label Unweighted\nFrequency%\n1No 3295 99.8 %\n2Yes 00.0 %\n Missing Data \n-9Missing 70.2 %\n Total 3,302 100%\nBased upon 3,295 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 1.00\nLocation: 11-12 (width: 2; decimal: 0)\nVariable Type: numeric \n- 3 -(Range of) Missing Values: -9 , -8 , -7 , -1\n',}
更新: 我怎样才能让上面的代码适用于这个值为字符串的文本?
data3 ={'F3: 尝试 血液尝试? 价值标签未加权 频率% 1 是,根据方案 2745 83.1 % 2是的,月经变化太大 91 2.8 % - 5 - 未加权的值标签 频率% 3是,最后一次尝试 405 12.3 % 4否,未禁食和/或不在窗口期 10.0 % 缺失数据 -9 缺失 16 0.5 % -1N/A 20.1 % .- 42 1.3 % 总计 3,302 100% 基于 3,302 个案例中的 3,242 个有效案例。 •最低:1.00 •最高:4.00 位置:21-22(宽度:2;小数:0) 变量类型:数字 (范围)缺失值:-9、-8、-7、-1、。 }
您可以尝试使用以下 RegEx 来收集命名的捕获组,您可以在以后轻松处理这些捕获组,如果它是
None
则排除标签,如果存在有效值则包括它。
import re
data = {'AG0': ': Age in \n- 2 -Value Label Unweighted\nFrequency%\n42- 367 11.1 %\n43- 421 12.7 %\n44- 416 12.6 %\n45- 389 11.8 %\n46- 400 12.1 %\n47- 392 11.9 %\n48- 299 9.1 %\n49- 255 7.7 %\n50- 168 5.1 %\n51- 115 3.5 %\n52- 71 2.2 %\n53- 40.1 %\n Missing Data \n.- 50.2 %\n Total 3,302 100%\nBased upon 3,297 valid cases out of 3,302 total cases.\n•Mean: 45.85\n•Median: 46.00\n•Mode: 43.00\n•Minimum: 42.00\n•Maximum: 53.00\n•Standard Deviation: 2.69\nLocation: 9-10 (width: 2; decimal: 0)\nVariable Type: numeric \n'}
pattern = r'(?P<value>\d+)-?(?P<label>\S+)?\s+(?P<frequency>\d+)\s+(?P<percent>\d+\.\d+)\s*%'
groups = [ m.groupdict() for line in data.get('AG0').split(sep='\n') if (m := re.match(pattern, line)) ]
groups
结果:
[{'value': '42', 'label': None, 'frequency': '367', 'percent': '11.1'},
{'value': '43', 'label': None, 'frequency': '421', 'percent': '12.7'},
{'value': '44', 'label': None, 'frequency': '416', 'percent': '12.6'},
{'value': '45', 'label': None, 'frequency': '389', 'percent': '11.8'},
{'value': '46', 'label': None, 'frequency': '400', 'percent': '12.1'},
{'value': '47', 'label': None, 'frequency': '392', 'percent': '11.9'},
{'value': '48', 'label': None, 'frequency': '299', 'percent': '9.1'},
{'value': '49', 'label': None, 'frequency': '255', 'percent': '7.7'},
{'value': '50', 'label': None, 'frequency': '168', 'percent': '5.1'},
{'value': '51', 'label': None, 'frequency': '115', 'percent': '3.5'},
{'value': '52', 'label': None, 'frequency': '71', 'percent': '2.2'}]
第二个数据字典示例:
import re
pattern = r'(?P<value>\d+)-?(?P<label>\S+)?\s+(?:(?P<frequency>\d+)\s+)?(?P<percent>\d+\.\d+)\s*%'
data2 = {'PRE': ': Currently ?\nAre you currently pregnant?\nValue Label Unweighted\nFrequency%\n1No 3295 99.8 %\n2Yes 00.0 %\n Missing Data \n-9Missing 70.2 %\n Total 3,302 100%\nBased upon 3,295 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 1.00\nLocation: 11-12 (width: 2; decimal: 0)\nVariable Type: numeric \n- 3 -(Range of) Missing Values: -9 , -8 , -7 , -1\n',}
groups2 = [ m.groupdict() for line in data2.get('PRE').split(sep='\n') if (m := re.match(pattern, line)) ]
groups2
结果:
[{'value': '1', 'label': 'No', 'frequency': '3295', 'percent': '99.8'},
{'value': '2', 'label': 'Yes', 'frequency': None, 'percent': '00.0'}]
您可能希望在两个示例中使用的第一个和第二个 RegEx 之间存在细微差别。我在第一个例子中排除了,因为正如@Swifty 提到的,值 53 的记录看起来很可疑。
(?:(?P<frequency>\d+)\s+)?
[编辑] 根据您提供的第三个数据字典,必须对 RegEx 进行小的修改:
(?P<label>\D+)
第三个例子:
data3 ={'F3': 'Blood draw attempted\nBlood draw attempted?\nValue Label Unweighted\nFrequency%\n1Yes, as per protocol 2745 83.1 %\n2Yes, menses too variable 91 2.8 %\n- 5 -Value Label Unweighted\nFrequency%\n3Yes, Last attempt 405 12.3 %\n4No, Not fasting and/or not in window 10.0 %\n Missing Data \n-9Missing 16 0.5 %\n-1N/A 20.1 %\n.- 42 1.3 %\n Total 3,302 100%\nBased upon 3,242 valid cases out of 3,302 total cases.\n•Minimum: 1.00\n•Maximum: 4.00\nLocation: 21-22 (width: 2; decimal: 0)\nVariable Type: numeric \n(Range of) Missing Values: -9 , -8 , -7 , -1 , .\n'}
pattern = r'(?P<value>\d+)-?(?P<label>\D+)?\s+(?:(?P<frequency>\d+)\s+)?(?P<percent>\d+\.\d+)\s*%'
groups3 = [ m.groupdict() for line in data3.get('F3').split(sep='\n') if (m := re.match(pattern, line)) ]
groups3
输出:
[{'value': '1',
'label': 'Yes, as per protocol',
'frequency': '2745',
'percent': '83.1'},
{'value': '2',
'label': 'Yes, menses too variable',
'frequency': '91',
'percent': '2.8'},
{'value': '3',
'label': 'Yes, Last attempt',
'frequency': '405',
'percent': '12.3'},
{'value': '4',
'label': 'No, Not fasting and/or not in window',
'frequency': None,
'percent': '10.0'}]