如何在Polars中根据多列过滤重复项?

问题描述 投票:0回答:2

我之前能够使用 df.filter(pl.col(['A','C']).is_duplicated()) 基于 多列

过滤重复项
,但在最新版本更新后,此功能不起作用。

import polars as pl


df = pl.DataFrame(
    {
        "A": [1,4,4,7,7,10,10,13,16],
        "B": [2,5,5,8,18,11,11,14,17],
        "C": [3,6,6,9,9,12,12,15,18]        
    }
)
df.filter(pl.col(['A','C']).is_duplicated())

报错

df.filter(df.select(
    pl.col(['A','C']).is_duplicated()
    )
)

报错

python python-polars
2个回答
1
投票

此行为在 0.16.10 中被认为不明确,并且会返回此错误:

exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions: 

        col("A").is_duplicated(),
        col("C").is_duplicated(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.

但是,0.19.0 删除了

all
/
any
已弃用的行为,并替换为
all_horizontal
any_horizontal
。要获得与 0.16.10 之前版本相同的行为,请使用
df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated()))

我稍微修改了输入以反映

any_horizontal
all_horizontal

之间的差异
import polars as pl

df = pl.DataFrame(
    {
        "A": [1,3,4,7,7,10,10,13,16],
        "B": [2,5,5,8,18,11,11,14,17],
        "C": [3,6,6,9,9,12,12,15,18]        
    }
)

# print("legacy run in 0.16.9: ", df.filter(pl.col(['A','C']).is_duplicated()))
print("all_horizontal: ", df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated())))
print("any_horizontal: ", df.filter(pl.any_horizontal(pl.col(['A','C']).is_duplicated())))
legacy run in 0.16.9:  shape: (4, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7   ┆ 8   ┆ 9   │
│ 7   ┆ 18  ┆ 9   │
│ 10  ┆ 11  ┆ 12  │
│ 10  ┆ 11  ┆ 12  │
└─────┴─────┴─────┘

all_horizontal:  shape: (4, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7   ┆ 8   ┆ 9   │
│ 7   ┆ 18  ┆ 9   │
│ 10  ┆ 11  ┆ 12  │
│ 10  ┆ 11  ┆ 12  │
└─────┴─────┴─────┘

any_horizontal:  shape: (6, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3   ┆ 5   ┆ 6   │
│ 4   ┆ 5   ┆ 6   │
│ 7   ┆ 8   ┆ 9   │
│ 7   ┆ 18  ┆ 9   │
│ 10  ┆ 11  ┆ 12  │
│ 10  ┆ 11  ┆ 12  │
└─────┴─────┴─────┘

0
投票

这不适用于字符串列,请参见下文:

import polars as pl

df = pl.DataFrame(
    {
        "row": range(6),
        "animal": ["dog", "dog", "cat", "cat", "fish", "fish"],
        "color": ["blue", "brown", "blue", "brown", "red", "yellow"],
    }
)

print(df.filter(pl.all_horizontal(pl.col(['animal','color']).is_duplicated()))) 
shape: (4, 3)
row animal  color
i64 str str
0   "dog"   "blue"
1   "dog"   "brown"
2   "cat"   "blue"
3   "cat"   "brown"

您应该使用 来自

polars
团队的这个用例:

df.filter(pl.struct('animal','color').is_duplicated())
shape: (0, 3)
row animal  color
i64 str str

后者实际上不会返回重复项。

© www.soinside.com 2019 - 2024. All rights reserved.