我有一个 1m 行 df,其中一列始终为 5000 个字符,A-Z0-9
我使用
将长列解析为 972 列def parse_long_string(df):
df['a001'] = df['long_string'].str[0:2]
df['a002'] = df['long_string'].str[2:4]
df['a003'] = df['long_string'].str[4:13]
df['a004'] = df['long_string'].str[13:22]
df['a005'] = df['long_string'].str[22:31]
df['a006'] = df['long_string'].str[31:40]
....
df['a972'] = df['long_string'].str[4994:]
return(df)
当我调用该函数时,收到以下警告:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
阅读 性能警告:DataFrame 高度碎片化。这通常是多次调用 `frame.insert` 的结果,性能较差,当创建 > 100 列且未指定新列的数据类型时会出现此问题,但每列自动为字符串。
除了
还有其他办法解决这个问题吗?warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
我不知道你是如何最终得到这样的配置的,但是是的,我可以在像你这样的类似数据帧上触发
PerformanceWarning
。所以,这是一个可能的解决方案来摆脱它,通过使用 concat
:
slices = {
"a001": (0, 2),
"a002": (2, 4),
"a003": (4, 13),
"a004": (13, 22),
"a005": (22, 31),
"a006": (31, 40),
# ... add the rest here
"a972": (4994, None)
} # # I used a dict but you can choose a list as well
def parse_long_string(df, mapper):
new_cols = pd.concat(
{
col: df["long_string"].str[s:e]
for col, (s, e) in mapper.items()
}, axis=1
)
return df.join(new_cols)
out = parse_long_string(df, slices)
输出:
print(out)
long_string a001 a002 a003 a004 ... a968 a969 a970 a971 a972
0 ILR03X... IL R0 3X 3D ... wm xC 95 cZ GT
1 uluF81... ul uF 81 Jl ... 98 RE 80 wc Qk
2 NLRCIh... NL RC Ih t4 ... Xk os KL Ge lp
3 ScrgOj... Sc rg Oj GS ... nM 8T gy Ju 8z
4 saWtdD... sa Wt dD zN ... cf o2 xX hM ze
... ... ... ... ... ... ... ... ... ... ... ...
9995 4FxlzY... 4F xl zY 6b ... fi Mb V9 Vf bK
9996 hsjUFa... hs jU Fa fL ... Io ka SJ 73 hM
9997 Sr4zFU... Sr 4z FU 3c ... yb 6a AF lv P4
9998 q4eon1... q4 eo n1 Kg ... 9g u1 dq sj Wa
9999 5UxVXL... 5U xV XL f2 ... zC 6F 7T kE kt
[10000 rows x 973 columns]
使用的输入:
import numpy as np
import pandas as pd
import string
np.random.seed(0)
df = pd.DataFrame({
"long_string": ["".join(np.random.choice(
[*string.printable[:62]], size=5000)) for _ in range(10000)]
})
slices = {f"a{i+1:03d}": (i*2, (i+1)*2) for i in range(972)}