让我有两个 Spark 数据框:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat_ws, coalesce
# Create a SparkSession
spark = SparkSession.builder \
.appName("ExampleDataFrames") \
.getOrCreate()
# Example data for DataFrame 1
data1 = [
("Pool_A", "A", "X", 10),
("Pool_A", "A", "Y", 20),
("Pool_A", "B", "X", 15),
("Pool_B", "A", "X", 5),
("Pool_B", "B", "Y", 25),
]
# Define the schema for DataFrame 1
df1_schema = ["pool", "col1", "col2", "value"]
# Create DataFrame 1
df1 = spark.createDataFrame(data1, df1_schema)
# Example data for DataFrame 2
data2 = [
("A", "X", 100),
("A", "Y", 200),
("B", "X", 150),
("B", "Y", 250),
("C", "X", 300),
]
# Define the schema for DataFrame 2
df2_schema = ["col1", "col2", "default_value"]
# Create DataFrame 2
df2 = spark.createDataFrame(data2, df2_schema)
我想通过传播每个池的 col1、col2 的所有可能组合来连接两个数据帧,并具有与其关联的 dafult 值。我有一个使用 crossJoin 的解决方案,但想看看是否还有其他优雅的解决方案(+使用 crossJoin 的性能成本)
这是所需的输出:
+-------+----+----+-----+
| pool|col1|col2|value|
+-------+----+----+-----+
| Pool_B| A| X| 5|
| Pool_B| B| Y| 25|
| Pool_B| C| X| 300|
| Pool_B| B| X| 150|
| Pool_B| A| Y| 200|
| Pool_A| A| X| 10|
| Pool_A| B| X| 15|
| Pool_A| A| Y| 20|
| Pool_A| B| Y| 250|
| Pool_A| C| X| 300|
+-------+----+----+-----+