我管理一个 PostgreSQL 数据库,并正在开发一个供用户访问数据库子集的工具。数据库有许多列,此外我们使用大量的 hstore 键来存储特定于数据库中某些行的附加信息。下面是基本示例
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
数据通常会导出到 CSV 文件,如下所示:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
我将其读入 Python 中的 Pandas 数据框,如下所示:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
然后我访问数据的子集。该子集在大多数(但不一定是所有)行中应该具有一组通用的 hstore 键。
我想为每个 hstore 键创建一个单独的列。如果行不存在键,则单元格应留空,或者用 NULL 或 NAN 值填充,无论哪种最简单。最有效的方法是什么?
.str.extractall()
从列 hstore
中提取键和值,然后使用 .pivot()
将键转换为列标签。通过 .groupby()
和 .agg()
聚合原始数据框中每行的条目。使用 NaN
设置
.replace()
为空条目。最后,使用 .join()
: 将结果数据帧连接回原始数据帧
df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
结果:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
如果您想获取一个新的数据帧进行提取而不是连接回原始数据帧,您可以删除
.join()
步骤并执行 .reindex()
,如下所示:
df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
结果:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN
如果您不介意按照here的建议安装额外的库(sqlalchemy),您可以执行以下操作:
import pandas as pd
import sqlalchemy.dialects.postgresql as postgresql
hstore_to_dict = postgresql.HSTORE().result_processor(None, None)
df = pd.read_csv("/filepath/file.csv")
df["hstore"] = df["hstore"].fillna("")
hstore_dict = df["hstore"].map(hstore_to_dict)
hstore_df = pd.json_normalize(hstore_dict)
# optionally merge the expanded hstore dataframe with the original dataframe
df = pd.concat([df, hstore_df], axis=1)
df.drop("hstore", axis=1, inplace=True)
pd.json_normalize
将字典列表转换为数据框,如here所回答。
我假设您的输入 CSV 文件如下所示:
A,B,C,hstore
bar,4,6,"""Pet""=>""cat"", ""Number""=>""5"", ""Country""=>""Suriname"""
foobar,2,8,
baz,3,1,"""Name""=>""David"", ""Fruit""=>""apple"""
foo,1,4,"""Pet""=>""dog"", ""Fruit""=>""apple"", ""Country""=>""Norway"""
运行代码后,输出将是这个数据框:
A B C Pet Number Country Name Fruit
0 bar 4 6 cat 5 Suriname NaN NaN
1 foobar 2 8 NaN NaN NaN NaN NaN
2 baz 3 1 NaN NaN NaN David apple
3 foo 1 4 dog NaN Norway NaN apple