我正在尝试 group_by 并使用极坐标查找相同数据具有不同 id 的组。我收到的错误是:错误:“Expr”对象没有属性“collect_list”。这是我的代码:
import polars as pl
from datetime import datetime
# Create a sample DataFrame with detailed personal data where all fields are the same except the client ID
df = pl.DataFrame({
"PER_FirstName": ["John", "John", "John"],
"PER_LastName": ["Doe", "Doe", "Doe"],
"PER_DOB": [datetime(1990, 5, 1), datetime(1990, 5, 1), datetime(1990, 5, 1)],
"PER_StreetAddress": ["123 Elm St", "123 Elm St", "123 Elm St"],
"PER_ClientID": [101, 102, 103]
})
# Using group_by and trying a potentially available function.
try:
result = df.group_by(['PER_FirstName', 'PER_LastName', 'PER_DOB', 'PER_StreetAddress']) \
.agg([
pl.col('PER_ClientID').n_unique().alias('unique_client_ids'),
pl.col('PER_ClientID').collect_list().alias('client_ids') # Trying collect_list
])
print(result)
except AttributeError as e:
print("Error:", e)
关于我做错了什么以及如何改变它以使其正常工作有什么想法吗?谢谢!
我最终使用了 Polar 和 pandas 的组合来解决这个问题。显然,目前没有极地设施可以按照我想要的方式收集东西,所以我采用了混合方法,如下所示:
import polars as pl
from datetime import datetime
import pandas as pd
from itables import show
# Aggregate to collect unique client IDs and count them
result = df.groupby(['PER_FirstName', 'PER_LastName', 'PER_DOB', 'PER_StreetAddress']) \
.agg([
pl.col('PER_ClientID').unique().alias('unique_client_ids'),
pl.col('PER_ClientID').n_unique().alias('count_unique_ids')
])
# Filter groups with exactly two unique client IDs
filtered_result = result.filter(pl.col('count_unique_ids') > 1)
# Convert Polars DataFrame to Pandas DataFrame
filtered_result_pd = filtered_result.to_pandas()
# Display the result using itables
#show(filtered_result_pd)
from itables import init_notebook_mode
from itables import show
init_notebook_mode(all_interactive=True)
show(filtered_result_pd, classes="display nowrap compact",paging=False, buttons=["copyHtml5", "csvHtml5", "excelHtml5"], scrollX = True)