我之前能够使用 df.filter(pl.col(['A','C']).is_duplicated())
基于 多列
过滤重复项,但在最新版本更新后,此功能不起作用。
import polars as pl
df = pl.DataFrame(
{
"A": [1,4,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
df.filter(pl.col(['A','C']).is_duplicated())
报错
df.filter(df.select(
pl.col(['A','C']).is_duplicated()
)
)
报错
此行为在 0.16.10 中被认为不明确,并且会返回此错误:
exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions:
col("A").is_duplicated(),
col("C").is_duplicated(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
但是,0.19.0 删除了
all
/any
已弃用的行为,并替换为 all_horizontal
和 any_horizontal
。要获得与 0.16.10 之前版本相同的行为,请使用 df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated()))
我稍微修改了输入以反映
any_horizontal
和 all_horizontal
之间的差异
import polars as pl
df = pl.DataFrame(
{
"A": [1,3,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
# print("legacy run in 0.16.9: ", df.filter(pl.col(['A','C']).is_duplicated()))
print("all_horizontal: ", df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated())))
print("any_horizontal: ", df.filter(pl.any_horizontal(pl.col(['A','C']).is_duplicated())))
legacy run in 0.16.9: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
all_horizontal: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
any_horizontal: shape: (6, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 5 ┆ 6 │
│ 4 ┆ 5 ┆ 6 │
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
这不适用于字符串列,请参见下文:
import polars as pl
df = pl.DataFrame(
{
"row": range(6),
"animal": ["dog", "dog", "cat", "cat", "fish", "fish"],
"color": ["blue", "brown", "blue", "brown", "red", "yellow"],
}
)
print(df.filter(pl.all_horizontal(pl.col(['animal','color']).is_duplicated())))
shape: (4, 3)
row animal color
i64 str str
0 "dog" "blue"
1 "dog" "brown"
2 "cat" "blue"
3 "cat" "brown"
您应该使用 来自
polars
团队的这个用例:
df.filter(pl.struct('animal','color').is_duplicated())
shape: (0, 3)
row animal color
i64 str str
后者实际上不会返回重复项。