Python 中是否有一种优化的方法来执行重复采样，类似于 R/dplyr 中的“rep_sample_n”？

Question

我正在寻找一种在Python中创建采样分布的优化方法，类似于

rep_sample_n

中的

dplyr

。目前，我在列表理解 (

df.sample

) 中使用

pd.concat([df.sample(size) for n in range(N)])

，这在语法上很直观，但当样本数量增加时速度相当慢。我在这里看到的大多数选项都涉及 for 循环或列表理解。

为了清楚起见，这里是一个还跟踪样本重复次数的示例：

df = pd.DataFrame({'value': range(3)})

sample_size = 2
replicates = 5

pd.concat([
    df.sample(sample_size).assign(replicate=rep)
    for rep in range(replicates)
])

输出：

   value  replicate
0      0          0
1      1          0
2      2          1
0      0          1
2      2          2
1      1          2
2      2          3
0      0          3
1      1          4
0      0          4

Answer 1

根据您的代码输出，它会生成一个

N*size x 1

数组，我认为您可以不用通过循环连接单个值，只需预先生成所有样本索引，然后为您的

dataframe

建立索引：

import numpy as np
N = 100
# You can use np.random.randint() if df.index is 0, 1, 2...
indexes = np.random.choice(df.index, size = N) 
resampled_df = df.iloc[indexes]

Python 中是否有一种优化的方法来执行重复采样，类似于 R/dplyr 中的“rep_sample_n”？

问题描述投票：0回答：1

1个回答

最新问题

Python 中是否有一种优化的方法来执行重复采样，类似于 R/dplyr 中的“rep_sample_n”？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1