Pyspark:如何进行查询,仅返回条目大于一个的ID?

问题描述 投票:0回答:2

我有一个看起来像下面的表

Timestamp,  Name,    Value  
1577862435, Tom,      0.25  
1577915618, Tom,      0.50  
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, Eric,     0.34

我要对每个用户的条目计数

query = """ SELECT ID,
            COUNT(*) AS `num`
            FROM
            myTable
            GROUP BY Name
            ORDER BY num DESC
"""
count = spark.sql(query)
count.show()

Name    num
John     3
Tom      2
Eric     1

我会查询一个返回具有num>=2的ID的查询。我的决赛桌应该是:

Timestamp,  Name,    Value  
1577862435, Tom,      0.25  
1577915618, Tom,      0.50  
1577839734, John,     0.34
1577839734, John,     0.34
1577839734, John,     0.34
python sql pyspark pyspark-sql
2个回答
0
投票
您应该使用窗口功能。

from pyspark.sql import Window df = spark.table("myTable") df.withColumn( "cnt", F.count('*').over(Window.partitionBy("Name")) ).where("cnt > 1").drop("cnt").show()


0
投票
您可以将其写为SQL:

SELECT ID, Name, num FROM (SELECT t.*, COUNT(*) OVER (PARTITION BY Name) AS num FROM myTable t ) t WHERE num >= 2;

© www.soinside.com 2019 - 2024. All rights reserved.