性能警告：创建新的 DataFrame 列时 DataFrame 高度碎片化

Question

我试图将新的 DataFrame 列设置为现有 DataFrame 的简单计算，但是当我运行脚本时，我收到来自 Pandas 的警告。这是主要代码

data_join['Ele_total'] = data_ele.sum(axis=1)
data_join['PV_total'] = data_pv.sum(axis=1)
data_join['SC'] = np.where(data_join['PV_total']>data_join['Ele_total'], data_join['Ele_total'], data_join['PV_total'])
data_join['SC%'] = np.where(data_join['PV_total']!= 0,round((data_join['SC']/data_join['PV_total'])*100,0),0)
data_join['SS%'] = np.where(data_join['Ele_total']!= 0,round((data_join['SC']/data_join['Ele_total'])*100,0),0)
data_join['LOLP'] = data_join['Ele_total']>data_join['PV_total']
data_join['E_tg'] = data_join['PV_total']-data_join['SC']
data_join['E_fg'] = data_join['Ele_total']-data_join['SC']
data_join['Ei'] = data_join['E_tg']-data_join['E_fg']
data_join['NGIP'] = data_join['Ei'].abs()<(GRID_LIM*n_build)
data_join['PAL'] = data_join['Ei'].abs()>(PEAK_LIM*n_build)
data_join['CO2'] = data_CO2['GWP']
data_join['CO2_net'] = data_CO2['GWP']*data_join['SC']
data_join['CO2_tot'] = data_CO2['GWP']*(data_join['E_tg']+data_join['SC'])



cash_flow = 0
npv = []
data_join_npv = pd.DataFrame()

for i in range (0,25):
    if i == 0:
        data_join_npv['PV_total_res_{}'.format(i)] = data_join_res['PV_total']
        data_join_npv['PV_total_ind_{}'.format(i)] = data_join_ind['PV_total']
    else:
        data_join_npv['PV_total_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i-1)]*(1-d)
        data_join_npv['PV_total_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i-1)]*(1-d)
    
    data_join_npv['SC_res_{}'.format(i)] = np.where(data_join_npv['PV_total_res_{}'.format(i)]>data_join_res['Ele_total'], data_join_res['Ele_total'], data_join_npv['PV_total_res_{}'.format(i)])
    data_join_npv['SC_ind_{}'.format(i)] = np.where(data_join_npv['PV_total_ind_{}'.format(i)]>data_join_ind['Ele_total'], data_join_ind['Ele_total'], data_join_npv['PV_total_ind_{}'.format(i)])
    data_join_npv['E_tg_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i)]-data_join_npv['SC_res_{}'.format(i)]
    data_join_npv['E_tg_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i)]-data_join_npv['SC_ind_{}'.format(i)]
    data_join_npv['E_fg_res_{}'.format(i)] = data_join_res['Ele_total']-data_join_npv['SC_res_{}'.format(i)]
    data_join_npv['E_fg_ind_{}'.format(i)] = data_join_ind['Ele_total']-data_join_npv['SC_ind_{}'.format(i)]
    cash = float(data_join_npv['SC_res_{}'.format(i)].sum())*COST_OF_ENERGY_RES + float(data_join_npv['E_tg_res_{}'.format(i)].sum())*VALUE_OF_ENERGY - float(data_join_npv['E_fg_res_{}'.format(i)].sum())*COST_OF_ENERGY_RES + float(data_join_npv['SC_ind_{}'.format(i)].sum())*COST_OF_ENERGY_IND + float(data_join_npv['E_tg_ind_{}'.format(i)].sum())*VALUE_OF_ENERGY - float(data_join_npv['E_fg_ind_{}'.format(i)].sum())*COST_OF_ENERGY_IND - OM_COST*total_pv
    cash_flow += cash/((1+DISC_RATE)**(i+1))
    npv.append(-in_inv+cash_flow)

这些是我收到的警告：

C:\Users\Giacomo\Desktop\150\insert_data.py:342: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['E_tg_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i)]-data_join_npv['SC_res_{}'.format(i)]
C:\Users\Giacomo\Desktop\150\insert_data.py:343: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['E_tg_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i)]-data_join_npv['SC_ind_{}'.format(i)]
C:\Users\Giacomo\Desktop\150\insert_data.py:344: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['E_fg_res_{}'.format(i)] = data_join_res['Ele_total']-data_join_npv['SC_res_{}'.format(i)]
C:\Users\Giacomo\Desktop\150\insert_data.py:345: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['E_fg_ind_{}'.format(i)] = data_join_ind['Ele_total']-data_join_npv['SC_ind_{}'.format(i)]
C:\Users\Giacomo\Desktop\150\insert_data.py:337: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['PV_total_res_{}'.format(i)] = data_join_npv['PV_total_res_{}'.format(i-1)]*(1-d)
C:\Users\Giacomo\Desktop\150\insert_data.py:338: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['PV_total_ind_{}'.format(i)] = data_join_npv['PV_total_ind_{}'.format(i-1)]*(1-d)
C:\Users\Giacomo\Desktop\150\insert_data.py:340: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['SC_res_{}'.format(i)] = np.where(data_join_npv['PV_total_res_{}'.format(i)]>data_join_res['Ele_total'], data_join_res['Ele_total'], data_join_npv['PV_total_res_{}'.format(i)])
C:\Users\Giacomo\Desktop\150\insert_data.py:341: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  data_join_npv['SC_ind_{}'.format(i)] = np.where(data_join_npv['PV_total_ind_{}'.format(i)]>data_join_ind['Ele_total'], data_join_ind['Ele_total'], data_join_npv['PV_total_ind_{}'.format(i)])
C:\Users\Giacomo\Desktop\150\insert_data.py:342: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()

我没有像警告所建议的那样使用

frame.insert()

，所以我不明白为什么我有这个关于碎片的警告。我得到了正确的结果，但由于我必须在优化器内多次运行代码，我认为我收到的大量警告在分析过程中的某个时刻停止了优化器，我想解决它们。

Answer 1

您收到这些多个警告是因为您重复地插入列到数据帧中

data_join_npv

而不是在for循环之后和外部将它们连接在一起，这在内存方面更加有效。

例如，运行这个玩具代码：

import pandas as pd


df = pd.DataFrame({f"col{i}": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] for i in range(1_000)})

new_df = pd.DataFrame()
for i in range(1_000):  # insert one thousand columns
    new_df[f"new_df_col{i}"] = df[f"col{i}"]+i

print(new_df)

您将得到以下输出：

性能警告：DataFrame 高度碎片化。这通常是多次调用
frame.insert
的结果，性能较差。考虑使用 pd.concat(axis=1) 一次连接所有列。要获得碎片整理的框架，请使用
newframe = frame.copy()
new_df[f"new_df_col{i}"] = df[f"col{i}"]+i

   new_df_col0  new_df_col1  new_df_col2  ...  new_df_col997  new_df_col998  new_df_col999
0            0            1            2  ...            997            998            999
1            1            2            3  ...            998            999           1000
2            2            3            4  ...            999           1000           1001
3            3            4            5  ...           1000           1001           1002
4            4            5            6  ...           1001           1002           1003
5            5            6            7  ...           1002           1003           1004
6            6            7            8  ...           1003           1004           1005
7            7            8            9  ...           1004           1005           1006
8            8            9           10  ...           1005           1006           1007
9            9           10           11  ...           1006           1007           1008

[10 rows x 1000 columns]

例如，初始化一个空字典而不是数据框，并使用 Pandas concat:

import pandas as pd


df = pd.DataFrame({f"col{i}": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] for i in range(1_000)})

data = {}
for i in range(1_000):
    data[f"new_col{i}"] = df[f"col{i}"] + i

new_df = pd.concat(data.values(), axis=1, ignore_index=True)
new_df.columns = data.keys()  # since Python 3.7, order of insertion is preserved

print(new_df)

您将在没有任何警告的情况下获得相同的数据框：

   new_col0  new_col1  new_col2  new_col3  ...  new_col996  new_col997  new_col998  new_col999
0         0         1         2         3  ...         996         997         998         999
1         1         2         3         4  ...         997         998         999        1000
2         2         3         4         5  ...         998         999        1000        1001
3         3         4         5         6  ...         999        1000        1001        1002
4         4         5         6         7  ...        1000        1001        1002        1003
5         5         6         7         8  ...        1001        1002        1003        1004
6         6         7         8         9  ...        1002        1003        1004        1005
7         7         8         9        10  ...        1003        1004        1005        1006
8         8         9        10        11  ...        1004        1005        1006        1007
9         9        10        11        12  ...        1005        1006        1007        1008

[10 rows x 1000 columns]

所以，尝试像这样重构你的代码：

cash_flow = 0
npv = []
data_join_npv = {}  # instead of pd.DataFrame()
for i in range (0,25):  # code unchanged
    ...
df = pd.concat(data_join_npv.values(), axis=1, ignore_index=True)
df.columns = data_join_npv.keys()

性能警告：创建新的 DataFrame 列时 DataFrame 高度碎片化

问题描述投票：0回答：1

1个回答

最新问题

性能警告：创建新的 DataFrame 列时 DataFrame 高度碎片化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1