我有一个看起来像下面的表
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, Eric, 0.34
我要对每个用户的条目计数
query = """ SELECT ID,
COUNT(*) AS `num`
FROM
myTable
GROUP BY Name
ORDER BY num DESC
"""
count = spark.sql(query)
count.show()
Name num
John 3
Tom 2
Eric 1
我会查询一个返回具有num>=2
的ID的查询。我的决赛桌应该是:
Timestamp, Name, Value
1577862435, Tom, 0.25
1577915618, Tom, 0.50
1577839734, John, 0.34
1577839734, John, 0.34
1577839734, John, 0.34
from pyspark.sql import Window
df = spark.table("myTable")
df.withColumn(
"cnt",
F.count('*').over(Window.partitionBy("Name"))
).where("cnt > 1").drop("cnt").show()
SELECT ID, Name, num
FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY Name) AS num
FROM myTable t
) t
WHERE num >= 2;