我有一本非常疯狂的字典,我正在尝试将其解析为 pandas 数据框。这是字典的较小版本:
import datetime
from decimal import *
test_dict = [{'record_id': '43bbdfbf',
'date': datetime.date(2023, 3, 25),
'person': {
'id': '123abc',
'name': 'Person1'
},
'venue': {
'id': '5bd6c74c',
'name': 'Place1',
'city': {
'id': '3448439',
'name': 'São Paulo',
'state': 'São Paulo',
'state_code': 'SP',
'coords': {'lat': Decimal('-23.5475'), 'long': Decimal('-46.63611111')},
'country': {'code': 'BR', 'name': 'Brazil'}
},
},
'thing_lists': {'thing_list': [
{'song': [
{'name': 'Thing1','info': None,'dup': None},
{'name': 'Thing2', 'info': None, 'dup': None},
{'name': 'Thing3', 'info': None, 'dup': None},
{'name': 'Thing4', 'info': None, 'dup': None}],
'extra': None},
{'song': [
{'name': 'ExtraThing1','info': None,'dup': None},
{'name': 'ExtraThing2', 'info': None, 'dup': None}],
'extra': 1
}]}}]
这是我开始构建的一个函数,用于从字典中解析出信息:
def extract_values(dictionary):
record_id = dictionary[0]['record_id'],
date = dictionary[0]['date'],
country = dictionary[0]['venue']['city']['country']['name']
return record_id, date, venue, city, lat, long, country
这是我尝试将这些片段提取到数据框中的片段。
import pandas as pd
df = pd.DataFrame(extract_values(test_dict)).transpose()
df.rename(
columns={
df.columns[0]: 'record_id',
df.columns[1]: 'date',
df.columns[3]: 'city',
df.columns[6]: 'country'
},
inplace=True
)
正如您所看到的,除了字符串字段之外,它大部分都有效,字符串字段被分割出来,每行都有一个字符。我不知道如何解决这个问题。然而,如果我拉的最后一个字段不是一根绳子,那么它就会被压回原位。有没有办法手动将字符串推到一起,这样我就不必依赖最终字段的数据类型?
此外,最后几个字段似乎很难提取。理想情况下,我希望我的最终数据框如下所示:
RecordID Date City Country ThingName Dup Extra
43bbdfbf 2023-03-25 São Paulo Brazil Thing1 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing2 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing3 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing4 None None
43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing1 None 1
43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing2 None 1
有人可以帮我指出如何正确解析这本字典的正确方向吗?
除了使用大量嵌套循环来提取所有值之外,我没有看到解决此问题的简单方法:
def extract_values(data):
records = []
for record in data:
for thing in record['thing_lists']['thing_list']:
for song in thing['song']:
records.append({
'RecordID' : record['record_id'],
'Date': record['date'],
'City': record['venue']['city']['name'],
'Country': record['venue']['city']['country']['name'],
'ThingName': song['name'],
'Dup': song['dup'],
'Extra': thing['extra']
})
return records
records = extract_values(test_dict)
df = pd.DataFrame(records)
输出:
RecordID Date City Country ThingName Dup Extra
0 43bbdfbf 2023-03-25 São Paulo Brazil Thing1 None NaN
1 43bbdfbf 2023-03-25 São Paulo Brazil Thing2 None NaN
2 43bbdfbf 2023-03-25 São Paulo Brazil Thing3 None NaN
3 43bbdfbf 2023-03-25 São Paulo Brazil Thing4 None NaN
4 43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing1 None 1.0
5 43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing2 None 1.0