在聚合中使用Python-Polars时如何计算众数

Question

我正在参与一个数据挖掘项目，并且在进行特征工程时遇到一些问题。我的目标之一是根据主键聚合数据，并生成新列。所以我写下这个：

df = df.group_by("case_id").agg(date_exprs(df,df_base))

def date_expr(df, df_base):
    # Join df and df_base on 'case_id' column
    df = df.join(df_base[['case_id','date_decision']], on="case_id", how="left")

    for col in df.columns:
        if col[-1] in ("D",):
            df = df.with_columns(pl.col(col) - pl.col("date_decision"))
            df = df.with_columns(pl.col(col).dt.total_days())

    cols = [col for col in df.columns if col[-1] in ("D",)]

    # Generate expressions for max, min, mean, mode, and std of date differences
    expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
    expr_min = [pl.min(col).alias(f"min_{col}") for col in cols]
    expr_mean = [pl.mean(col).alias(f"mean_{col}") for col in cols]
    expr_mode = [pl.mode(col).alias(f"mode_{col}") for col in cols]
    expr_std = [pl.std(col).alias(f"std_{col}") for col in cols]

    return expr_max + expr_min + expr_mean + expr_mode + expr_std

但是，出现错误： AttributeError: module 'polars' has no attribute 'mode'.

我在github上查找了polars的文档，发现没有Dataframe.mode()而是Series.mode()，我认为这可能是错误的原因？我参考了chatGPT，它没有帮助，因为这些错误的代码就是来自它。

另外，这里只是处理float类型的一个例子。那么字符串类型呢？我也可以应用你的方法吗？

我期待您的帮助！！

Answer 1

在您的示例中，它失败了，因为

Expr.mode()

没有语法糖，因为它是聚合函数（例如，

pl.max()

是

Expr.max()

的语法糖。

mode()

实际上不是聚合函数，但计算函数之一，这意味着它只计算列中最常出现的值。

所以，给定这样的 DataFrame：

df = (
    pl.DataFrame({
        'aD' : [200, 200, 300, 400, 1, 3],
        'bD': [2, 3, 6, 4, 5, 1],
        'case_id': [1,1,1,2,2,2]
    })
)

┌─────┬─────┬─────────┐
│ aD  ┆ bD  ┆ case_id │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64     │
╞═════╪═════╪═════════╡
│ 200 ┆ 2   ┆ 1       │
│ 200 ┆ 3   ┆ 1       │
│ 300 ┆ 6   ┆ 1       │
│ 400 ┆ 4   ┆ 2       │
│ 1   ┆ 5   ┆ 2       │
│ 3   ┆ 1   ┆ 2       │
└─────┴─────┴─────────┘

您可以使用以下代码计算

mode()

：

df.with_columns(
    pl.col('aD').mode(),
    pl.col('bD').mode()
)

┌─────┬─────┬─────────┐
│ aD  ┆ bD  ┆ case_id │
│ --- ┆ --- ┆ ---     │
│ i64 ┆ i64 ┆ i64     │
╞═════╪═════╪═════════╡
│ 200 ┆ 1   ┆ 1       │
│ 200 ┆ 5   ┆ 1       │
│ 200 ┆ 6   ┆ 1       │
│ 200 ┆ 4   ┆ 2       │
│ 200 ┆ 2   ┆ 2       │
│ 200 ┆ 3   ┆ 2       │
└─────┴─────┴─────────┘

鉴于此，我们仍然可以计算出您需要的结果。我将使用

selectors

和

Expr.prefix()

来简化你的功能：

import polars.selectors as cs

def date_expr():
    # Generate expressions for max, min, mean, mode, and std of date differences
    expr_max = cs.ends_with('D').max().name.prefix("max_")
    expr_min = cs.ends_with('D').min().name.prefix("min_")
    expr_mean = cs.ends_with('D').mean().name.prefix("mean_")
    expr_mode = cs.ends_with('D').mode().first().name.prefix("mode_")
    expr_std = cs.ends_with('D').std().name.prefix("std_")

    return expr_max, expr_min, expr_mean, expr_std, expr_mode

df.group_by("case_id").agg(date_expr())

┌─────────┬────────┬────────┬────────┬───┬────────────┬──────────┬─────────┬─────────┐
│ case_id ┆ max_aD ┆ max_bD ┆ min_aD ┆ … ┆ std_aD     ┆ std_bD   ┆ mode_aD ┆ mode_bD │
│ ---     ┆ ---    ┆ ---    ┆ ---    ┆   ┆ ---        ┆ ---      ┆ ---     ┆ ---     │
│ i64     ┆ i64    ┆ i64    ┆ i64    ┆   ┆ f64        ┆ f64      ┆ i64     ┆ i64     │
╞═════════╪════════╪════════╪════════╪═══╪════════════╪══════════╪═════════╪═════════╡
│ 2       ┆ 400    ┆ 5      ┆ 1      ┆ … ┆ 229.787583 ┆ 2.081666 ┆ 3       ┆ 4       │
│ 1       ┆ 300    ┆ 6      ┆ 200    ┆ … ┆ 57.735027  ┆ 2.081666 ┆ 200     ┆ 2       │
└─────────┴────────┴────────┴────────┴───┴────────────┴──────────┴─────────┴─────────┘

请注意，我已将

Expr.first()

用作

mode

的值之一 - 因为可能存在具有相同频率的不同值。您可以使用

list

表达式来指定您想要获得哪一个。

在聚合中使用Python-Polars时如何计算众数

问题描述投票：0回答：1

1个回答

最新问题

在聚合中使用Python-Polars时如何计算众数

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1