对我的大df进行条件采样的最有效方法

问题描述 投票:0回答:1

我有一个很大的 DF(约 3500 万行),我试图通过从每个唯一的簇 ID(约 180 万个唯一的簇 ID)中随机采样两行来创建一个新的 DF - 一行必须有标签 0 和 1行必须有一个标签 1(有时只有一个标签,所以我必须首先检查两个标签是否都存在于集群中)。作为参考,我的数据集有 3 个主要列:“embeddings”、“cluster_ID”、“label”。

我发现这个过程花费的时间比我预期的要多,我想知道是否有办法优化我的代码。

import pandas as pd
import random

# Create an empty list to store selected rows
selected_rows = []

# Iterate over unique cluster IDs
for cluster_id in result_df_30['cluster_ID'].unique():
    # Filter the DataFrame for the current cluster ID
    df_cluster = result_df_30[result_df_30['cluster_ID'] == cluster_id]
    
    # Filter rows with label 0 and 1
    df_label_0 = df_cluster[df_cluster['label'] == 0]
    df_label_1 = df_cluster[df_cluster['label'] == 1]
    
    # Sample rows if they exist
    if not df_label_0.empty:
        sample_label_0 = df_label_0.sample(n=1, random_state=42)
        selected_rows.append(sample_label_0)
    if not df_label_1.empty:
        sample_label_1 = df_label_1.sample(n=1, random_state=42)
        selected_rows.append(sample_label_1)

# Concatenate the selected rows into a single DataFrame
selected_rows_df = pd.concat(selected_rows)

selected_rows_df

python pandas dataframe numpy sample
1个回答
0
投票

这要求您的数据帧按顺序索引(0, 1, ..., 35m)。这非常简单,

df = df.reset_index()

# Some test data
n = 35_000_000
df = pd.DataFrame(
    {
        "embeddings": np.random.rand(n),
        "cluster_ID": np.random.randint(0, 1_800_000, n),
        "label": np.random.randint(0, 100, n),
    }
)

tmp = (
    # since you only care about rows with label 0 or 1
    df[df["label"].isin([0, 1])]  
    # shuffle the rows
    .sample(frac=1) 
    # reset the index
    .reset_index()  
    # for each Cluster ID and label, get the first row
    .pivot_table(index="cluster_ID", columns="label", values="index", aggfunc="first")
)

# get the index of the rows that have both labels
idx = tmp[tmp.notna().all(axis=1)].to_numpy("int").flatten()

# and your random sample
sample = df.iloc[idx]
© www.soinside.com 2019 - 2024. All rights reserved.