我正在尝试
apply
一个简单的 value_counts()
到极地 dataframe
上的多个列,但出现错误。
import polars as pl
import pandas as pd
数据:
sample_df = pl.DataFrame({'sub-category': ['tv','mobile','tv','wm','micro','wm'],
'category': ['electronics','mobile','electronics','electronics','kitchen','electronics']})
失败的尝试:
#1
sample_df.apply(value_counts())
#2
sample_df.apply(lambda x: x.value_counts())
#3
sample_df.apply(lambda x: x.to_series().value_counts())
#4
sample_df.select(pl.col(['sub-category','category'])).apply(lambda x: x.value_counts())
#5
sample_df.select(pl.col(['sub-category','category'])).apply(lambda x: x.to_series().value_counts())
但是如果我将它转换为
Pandas
数据框那么它就可以工作:
sample_df.to_pandas().apply(lambda x: x.value_counts())
你可以
.melt
+.groupby().count()
df.melt(variable_name="column").groupby(pl.all()).count()
shape: (7, 3)
┌──────────────┬─────────────┬───────┐
│ column ┆ value ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════════════╪═════════════╪═══════╡
│ sub-category ┆ mobile ┆ 1 │
│ category ┆ kitchen ┆ 1 │
│ sub-category ┆ wm ┆ 2 │
│ sub-category ┆ tv ┆ 2 │
│ sub-category ┆ micro ┆ 1 │
│ category ┆ mobile ┆ 1 │
│ category ┆ electronics ┆ 4 │
└──────────────┴─────────────┴───────┘
.pivot
进行从长到宽的重塑。
df.melt(variable_name="column").groupby(pl.all()).count().pivot(
values = "count",
index = "value",
columns = "column",
aggregate_function = None
)
shape: (6, 3)
┌─────────────┬──────────────┬──────────┐
│ value ┆ sub-category ┆ category │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 │
╞═════════════╪══════════════╪══════════╡
│ mobile ┆ 1 ┆ 1 │
│ wm ┆ 2 ┆ null │
│ micro ┆ 1 ┆ null │
│ kitchen ┆ null ┆ 1 │
│ electronics ┆ null ┆ 4 │
│ tv ┆ 2 ┆ null │
└─────────────┴──────────────┴──────────┘
所以如果你想要像 pandas 一样的结果,你必须做一些小的工作,我认为 pandas 是隐式的,因为它是基于索引的。
因此,您为每一列创建一个值计数,然后将它们与外部连接合并。
def multiple_column_value_counts(df, columns, value_column = "values"):
value_counts_dfs = []
for column in columns:
counts = df.get_column(column).value_counts()
name_map = {k: v for k, v in zip(counts.columns, [value_column, column])}
value_counts_dfs.append(counts.rename(name_map))
merge_df = value_counts_dfs.pop(0)
for value_counts in value_counts_dfs:
merge_df = merge_df.join(value_counts, on = value_column, how = "outer")
return(merge_df)
columns = ['category', 'sub-category']
multiple_column_value_counts(sample_df, columns)
shape: (6, 3)
┌─────────────┬──────────┬──────────────┐
│ values ┆ category ┆ sub-category │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 │
╞═════════════╪══════════╪══════════════╡
│ micro ┆ null ┆ 1 │
│ mobile ┆ 1 ┆ 1 │
│ wm ┆ null ┆ 2 │
│ tv ┆ null ┆ 2 │
│ kitchen ┆ 1 ┆ null │
│ electronics ┆ 4 ┆ null │
└─────────────┴──────────┴──────────────┘