极坐标数据框中每组的样本？

Question

我正在寻找类似的功能

df.groupby('column').agg(sample(10))

这样我就可以从每组中随机选择十个左右的元素。

这是专门为了让我可以在 LazyFrame 中读取并使用每个组的一小部分样本，而不是整个数据帧。

更新：

一个近似解决方案是：

df = lf.groupby('column').agg(
        pl.all().sample(.001)
    )
df = df.explode(df.columns[1:])

更新2

该近似解决方案与对整个数据帧进行采样并随后进行分组相同。不好。

Answer 1

让我们从一些虚拟数据开始：

n = 100
seed = 0
df = pl.DataFrame(
    {
        "groups": (pl.int_range(0, n, eager=True) % 5).shuffle(seed=seed),
        "values": pl.int_range(0, n, eager=True).shuffle(seed=seed)
    }
)
df

shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 55     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 40     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 57     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 99     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

这就是 100 / 5，即 5 组，每组 20 个元素。让我们验证一下：

df.groupby("groups").agg(pl.count())

shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ ---    ┆ ---   │
│ i64    ┆ u32   │
╞════════╪═══════╡
│ 1      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2      ┆ 20    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0      ┆ 20    │
└────────┴───────┘

我们的数据样本

现在我们将使用窗口函数来获取数据样本。

df.filter(
    pl.int_range(0, pl.count()).shuffle().over("groups") < 10
)

shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ ---    ┆ ---    │
│ i64    ┆ i64    │
╞════════╪════════╡
│ 0      ┆ 85     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0      ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 84     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 19     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2      ┆ 87     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1      ┆ 96     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3      ┆ 43     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4      ┆ 44     │
└────────┴────────┘

对于

over("group")

中的每个组，

pl.int_range(0, pl.count())

表达式都会创建一个索引行。然后我们

shuffle

该范围，以便我们采集样本而不是切片。然后我们只想获取低于 10 的索引值。这将创建一个

boolean mask

，我们可以将其传递给

filter

方法。

Answer 2

我们可以尝试制作自己的类似 groupby 的功能，并从过滤后的子集中进行采样。

samples = []
cats = df.get_column('column').unique().to_list()
for cat in cats:
    samples.append(df.filter(pl.col('column') == cat).sample(10))
samples = pl.concat(samples)

在文档中找到

partition_by

，这应该更有效，因为至少这些组是使用 api 并在数据帧的单次传递中创建的。不幸的是，对每组进行采样仍然是线性的。

pl.concat([x.sample(10) for x in df.partition_by(groups="column")])

第三次尝试，采样指标：

import numpy as np
import random

indices = df.groupby("group").agg(pl.col("value").agg_groups()).get_column("value").to_list()
sampled = np.array([random.sample(x, 10) for x in indices]).flatten()
df[sampled]

Answer 3

这对我来说效果更好：

sampled_df = pl.concat(
    df.sample(0.001) for df in 
    df.partition_by(["column"], include_key=True)
)

.agg(pl.col("column").sample(2)

的问题在于它似乎为每列选择了不同的值。我需要的是随机选择的行。

极坐标数据框中每组的样本？

问题描述投票：0回答：3

更新：

更新2

3个回答

我们的数据样本

最新问题

极坐标数据框中每组的样本？

问题描述 投票：0回答：3

更新：

更新2

3个回答

我们的数据样本

最新问题

问题描述投票：0回答：3