在另一个Spark SQL查询中使用PySpark数据框列

Question

我遇到一种情况，我试图查询一个表并将该查询的结果（数据框）用作另一个查询的IN子句。

在第一个查询中，我具有以下数据框：

+-----------------+
|key              |
+-----------------+
|   10000000000004|
|   10000000000003|
|   10000000000008|
|   10000000000009|
|   10000000000007|
|   10000000000006|
|   10000000000010|
|   10000000000002|
+-----------------+

现在，我想使用该数据帧的值动态地运行以下查询，而不是对值进行硬编码：

spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()

我尝试了以下操作，但是没有用：

df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()

有人可以帮我吗？

Answer 1

您应该在两个数据框之间使用（内部）联接，以获取所需的国家/地区。看我的例子：

# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]

# Create a list of Ids
numbers = [(1,), (2,)]  

# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])

数据帧如下所示：

df_countries:

+-----------+---+
|CountryName| Id| 
+-----------+---+
|Netherlands|  1|
|     France|  2|
|    Germany|  3|
|    Belgium|  4|
+-----------+---+

df_numbers:
+---+
| Id|
+---+
|  1|
|  2|
+---+

您可以如下加入他们：

countries.join(numbers, on='Id', how='inner')

结果：

+---+-----------+
| Id|CountryName|
+---+-----------+
|  1|Netherlands|
|  2|     France|
+---+-----------+

希望能解决问题！

在另一个Spark SQL查询中使用PySpark数据框列

问题描述投票：0回答：2

2个回答

最新问题

在另一个Spark SQL查询中使用PySpark数据框列

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2