我有一个很大的 DF(约 3500 万行),我试图通过从每个唯一的簇 ID(约 180 万个唯一的簇 ID)中随机采样两行来创建一个新的 DF - 一行必须有标签 0 和 1行必须有一个标签 1(有时只有一个标签,所以我必须首先检查两个标签是否都存在于集群中)。作为参考,我的数据集有 3 个主要列:“embeddings”、“cluster_ID”、“label”。
我发现这个过程花费的时间比我预期的要多,我想知道是否有办法优化我的代码。
import pandas as pd
import random
# Create an empty list to store selected rows
selected_rows = []
# Iterate over unique cluster IDs
for cluster_id in result_df_30['cluster_ID'].unique():
# Filter the DataFrame for the current cluster ID
df_cluster = result_df_30[result_df_30['cluster_ID'] == cluster_id]
# Filter rows with label 0 and 1
df_label_0 = df_cluster[df_cluster['label'] == 0]
df_label_1 = df_cluster[df_cluster['label'] == 1]
# Sample rows if they exist
if not df_label_0.empty:
sample_label_0 = df_label_0.sample(n=1, random_state=42)
selected_rows.append(sample_label_0)
if not df_label_1.empty:
sample_label_1 = df_label_1.sample(n=1, random_state=42)
selected_rows.append(sample_label_1)
# Concatenate the selected rows into a single DataFrame
selected_rows_df = pd.concat(selected_rows)
selected_rows_df
这要求您的数据帧按顺序索引(0, 1, ..., 35m)。这非常简单,
df = df.reset_index()
。
# Some test data
n = 35_000_000
df = pd.DataFrame(
{
"embeddings": np.random.rand(n),
"cluster_ID": np.random.randint(0, 1_800_000, n),
"label": np.random.randint(0, 100, n),
}
)
tmp = (
# since you only care about rows with label 0 or 1
df[df["label"].isin([0, 1])]
# shuffle the rows
.sample(frac=1)
# reset the index
.reset_index()
# for each Cluster ID and label, get the first row
.pivot_table(index="cluster_ID", columns="label", values="index", aggfunc="first")
)
# get the index of the rows that have both labels
idx = tmp[tmp.notna().all(axis=1)].to_numpy("int").flatten()
# and your random sample
sample = df.iloc[idx]