我尝试搜索此内容,但最接近的是this。但这并没有给我我想要的东西。我想将重复项的所有实例都放在数据框中。例如,如果我有一个数据框
Col1 Col2 Col3
Alice Girl April
Jean Boy Aug
Jean Boy Sept
我想删除基于Col1和Col2的all重复项,以便得到
Col1 Col2 Col3
Alice Girl April
有什么办法吗?
from pyspark.sql import functions as F
# Sample Dataframe
df = sqlContext.createDataFrame([
["Alice", "Girl","April"],
["Jean","Boy", "Aug"],
["Jean","Boy","Sept"]
],
["Col1","Col2", "Col3"])
# Group by on required column and select rows where count is 1.
df2 = (df
.groupBy(["col1", "col2"])
.agg(
F.count(F.lit(1)).alias('count'),
F.max("col3").alias("col3")).where("count = 1")).drop("count")
df2.show(10, False)
输出:
+-----+----+-----+
|col1 |col2|col3 |
+-----+----+-----+
|Alice|Girl|April|
+-----+----+-----+