如何提高 Pandas DataFrame 中随机列选择和分配的效率?

问题描述 投票:0回答:1

我正在开发一个项目,我需要基于现有的 DataFrame 创建一个新的 DataFrame,随机选择某些列并分配到每一行中,其概率与该列中的数字成正比。

但是,我当前的实现似乎效率低下,尤其是在处理大型数据集时。我正在寻求有关如何优化此流程以获得更好性能的建议。

这是我目前正在做的事情的简化版本:

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'dog': [1, 2, 3, 4],
    'cat': [5, 6, 7, 8],
    'parrot': [9, 10, 11, 12],
    'owner': ['fred', 'bob', 'jim', 'jannet']
}
df = pd.DataFrame(data)

# List of relevant columns
relevant_col_list = ['dog', 'cat', 'parrot']

# New DataFrame with the same number of rows
new_df = df.copy()

# Create 'iteration_1' column in new_df
new_df['iteration_1'] = ""

# Iterate over rows
for index, row in new_df.iterrows():
    # Copy columns not in relevant_col_list 
    for column in new_df.columns:
        if column not in relevant_col_list and column != 'iteration_1':
            new_df.at[index, column] = row[column]
    
    # Randomly select a column from relevant_col_list with probability proportional to the number in the column
    probabilities = df[relevant_col_list ].iloc[index] / df[relevant_col_list ].iloc[index].sum()
    chosen_column = np.random.choice(relevant_col_list , p=probabilities)
    
    # Write the name of the chosen column in the 'iteration_1' column
    new_df.at[index, 'iteration_1'] = chosen_column

print(new_df)

我怎样才能加快速度?

python pandas performance random
1个回答
2
投票

您可以首先重新设计 DataFrame 以选择感兴趣的列,标准化权重,然后创建一个 cumsum。

# cumulated probabilities
array([[0.06666667, 0.4       , 1.        ],
       [0.11111111, 0.44444444, 1.        ],
       [0.14285714, 0.47619048, 1.        ],
       [0.16666667, 0.5       , 1.        ]])

之后生成 n 个 0-1 之间的随机数并执行 2D 搜索排序(此处的方法):

tmp = (df[relevant_col_list]
       .pipe(lambda x: x.div(x.sum(axis=1), axis=0))
       .cumsum(axis=1).to_numpy()
      )

r = np.random.random(len(df))

def searchsorted2d(a,b):
    s = np.r_[0,(np.maximum(a.max(1)-a.min(1)+1,b)+1).cumsum()[:-1]]
    a_scaled = (a+s[:,None]).ravel()
    b_scaled = b+s
    return np.searchsorted(a_scaled,b_scaled)-np.arange(len(s))*a.shape[1]

df['iteration_1'] = np.array(relevant_col_list)[searchsorted2d(tmp, r)]

输出:

   col1  col2  col3  col4 iteration_1
0     1     5     9    13        col3
1     2     6    10    14        col1
2     3     7    11    15        col3
3     4     8    12    16        col2
© www.soinside.com 2019 - 2024. All rights reserved.