我的 SparkSQL DataFrame 看起来像这样:
+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 123| abc| X| 3|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
我想根据第
a
列中没有值 b
的键列 N
和 c
删除重复项 IF 该组合键存在一行具有值 N
在第 c
栏中。
预期输出:
+-----------+----------+----------+----------+
|a |b |c |d |
+-----------+----------+----------+----------+
| 123| abc| N| 2|
| 123| abc| N| 4|
| 456| def| K| 1|
| 456| def| X| 4|
+-----------+----------+----------+----------+
对于组合键
a = 123
、b = abc
,存在不止一行其中c = N
。因此,我们要删除该组合键的所有行,其中 c != N
.
对于组合键
a = 456
、b = def
,不存在任何包含 c = N
的行。因此我们不想删除任何行。
您可以通过以下步骤获得所需的结果:
Pyspark
from pyspark.sql.functions import *
df = # Read data
# Get a, b group where N is present
df_N = df.filter(col('c') == 'N').groupBy('a', 'b').agg(count('c').alias('count_N'))
# Join with original dataframe and apply required filter
df_out = df.join(df_N, on=['a', 'b'], how='left') \
.filter((col('count_N').isNull()) | (col('c') == 'N')) \
.drop('count_N')
df_out.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
|123|abc| N| 2|
|123|abc| N| 4|
|456|def| K| 1|
|456|def| X| 4|
+---+---+---+---+
Spark-SQL
spark.sql("""
SELECT *
FROM <table_name>
WHERE (a, b) NOT IN (
SELECT a, b
FROM<table_name>
WHERE c = 'N'
GROUP BY a, b
HAVING COUNT(*) > 0
)
OR c = 'N'
""")