Pyspark数据帧过滤器OR条件

Question

我试图根据OR条件过滤我的pyspark数据帧，如下所示：

filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter(file_df.fw == "4940" | file_df.fw == "4960")

我想只返回file_df.fw ==“4940”或者file_df.fw ==“4960”的行但是当我尝试这个时我得到这个错误：

Py4JError: An error occurred while calling o157.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist

我究竟做错了什么？

如果没有OR条件，当我尝试仅在一个条件下进行过滤时（file_df.fw=="4940"）

Answer 1

错误消息是由运营商的不同优先级引起的。 |（OR）作为比较运算符==具有更高的优先级。 Spark尝试应用OR "4940"和file_df.fw并不像你想要它在(file_df.fw == "4940")和(file_df.fw == "4960")。您可以使用括号更改优先级。看看下面的例子：

columns = ['dst_name','fw']

file_df=spark.createDataFrame([('ntp.obspm.fr','3000'),
                               ('ntp.obspm.fr','4940'),
                               ('ntp.obspm.fr','4960'),
                               ('ntp.obspm.de', '4940' )],
                              columns)

#here I have added the brackets
filtered_df = file_df.filter(file_df.dst_name == "ntp.obspm.fr").filter((file_df.fw == "4940") | (file_df.fw == "4960"))
filtered_df.show()

输出：

+------------+----+ 
|    dst_name|  fw| 
+------------+----+ 
|ntp.obspm.fr|4940| 
|ntp.obspm.fr|4960| 
+------------+----+

Pyspark数据帧过滤器OR条件

问题描述投票：1回答：1

1个回答

最新问题

Pyspark数据帧过滤器OR条件

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1