PySpark 中的比较运算符（不等于/！=）

Question

我正在尝试获取数据框中两个标志设置为“1”的所有行，以及随后仅两个标志之一设置为“1”而另一个不等于设置为“1”的所有行

具有以下架构（三列），

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                            schema=('id', 'foo', 'bar')
                            )

我获得以下数据框：

+---+----+----+
| id| foo| bar|
+---+----+----+
|  a|   1|null|
|  b|   1|   1|
|  c|   1|null|
|  d|null|   1|
|  e|   1|   1|
+---+----+----+

当我应用所需的过滤器时，第一个过滤器（foo=1 AND bar=1）起作用，但另一个过滤器不起作用（foo=1 AND NOT bar=1）

foobar_df = df.filter( (df.foo==1) & (df.bar==1) )

产量：

+---+---+---+
| id|foo|bar|
+---+---+---+
|  b|  1|  1|
|  e|  1|  1|
+---+---+---+

这是非行为过滤器：

foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
foo_df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
+---+---+---+

为什么不过滤？如何获取只有 foo 等于“1”的列？

Answer 1

为什么不过滤

因为是SQL，

NULL

表示缺失值。因此，除了

NULL

和

IS NULL

之外，与

IS NOT NULL

的任何比较都是未定义的。您需要：

col("bar").isNull() | (col("bar") != 1)

或

coalesce(col("bar") != 1, lit(True))

或（PySpark >= 2.3）：

col("bar").eqNullSafe(1)

如果您想在 PySpark 中进行 null 安全比较。

而且

'null'

也不是引入

NULL

文字的有效方法。您应该使用

None

来指示丢失的对象。

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

Answer 2

要过滤空值，请尝试：

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

Answer 3

所选的正确答案没有解决问题，其他答案对于pyspark来说都是错误的。

此解决方案在 pyspark 中没有等效的“!=”运算符。正确的答案是使用“==”和“~”否定运算符，如下所示：

df = df.withColumn(when(~(col("column_name_here") == "string_value")), lit("update_value").cast(SomeType()))

PySpark 中的比较运算符（不等于/！=）

问题描述投票：0回答：3

3个回答

最新问题

PySpark 中的比较运算符（不等于/！=）

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3