我正在开发一个项目,我需要基于现有的 DataFrame 创建一个新的 DataFrame,随机选择某些列并分配到每一行中,其概率与该列中的数字成正比。
但是,我当前的实现似乎效率低下,尤其是在处理大型数据集时。我正在寻求有关如何优化此流程以获得更好性能的建议。
这是我目前正在做的事情的简化版本:
import pandas as pd
import numpy as np
# Sample DataFrame
data = {
'dog': [1, 2, 3, 4],
'cat': [5, 6, 7, 8],
'parrot': [9, 10, 11, 12],
'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)
# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']
# New DataFrame with the same number of rows
new_df = df.copy()
# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""
# Iterate over rows
for index, row in new_df.iterrows():
# Copy columns not in relevant_col_list
for column in new_df.columns:
if column not in relevant_col_list and column != 'iteration_1':
new_df.at[index, column] = row[column]
# Randomly select a column from relevant_col_list with probability proportional to the number in the column
probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
chosen_column = np.random.choice(relevant_col_list , p=probabilities)
# Write the name of the chosen column in the 'iteration_1' column
new_df.at[index, 'iteration_1'] = chosen_column
print(new_df)
我怎样才能加快速度?
您可以首先重新设计 DataFrame 以选择感兴趣的列,标准化权重,然后创建一个 cumsum。
# cumulated probabilities
array([[0.06666667, 0.4 , 1. ],
[0.11111111, 0.44444444, 1. ],
[0.14285714, 0.47619048, 1. ],
[0.16666667, 0.5 , 1. ]])
之后生成 n 个 0-1 之间的随机数并执行 2D 搜索排序(此处的方法):
tmp = (df[relevant_col_list]
.pipe(lambda x: x.div(x.sum(axis=1), axis=0))
.cumsum(axis=1).to_numpy()
)
r = np.random.random(len(df))
def searchsorted2d(a,b):
s = np.r_[0,(np.maximum(a.max(1)-a.min(1)+1,b)+1).cumsum()[:-1]]
a_scaled = (a+s[:,None]).ravel()
b_scaled = b+s
return np.searchsorted(a_scaled,b_scaled)-np.arange(len(s))*a.shape[1]
df['iteration_1'] = np.array(relevant_col_list)[searchsorted2d(tmp, r)]
输出:
col1 col2 col3 col4 iteration_1
0 1 5 9 13 col3
1 2 6 10 14 col1
2 3 7 11 15 col3
3 4 8 12 16 col2