xlsx - 使用 pyspark 读入 Spark 数据帧时列映射不正确

问题描述投票：0回答：1

我正在尝试使用下面的 pyspark 代码读取 Excel 文件

df_data = spark.read.format("com.crealytics.spark.excel") \
            .option("header", "true") \
            .option("dataAddress", f"'{sheet_name}'!A1") \
            .option("treatEmptyValuesAsNulls", "false")\
            .schema(custom_schema) \
            .load(file_path)

根据文件，列名称的映射顺序不正确。例如

file:
col1 col2 col3
12    23   null

Df output: 
  col2 col3 col1
  null 12    23

让我知道如何在排序更正列映射时解决此问题。预先感谢。

python

apache-spark

pyspark

azure-databricks

xlsx

1个回答

0
投票

我尝试过以下方法：

from pyspark.sql.types import StringType, StructField, StructType
file_path = "/FileStore/tables/exclk.xlsx"
sheet_name = "Sheet1" 
schema = StructType([
    StructField("col1", StringType(), nullable=True),
    StructField("col2", StringType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])
desired_order = ['col1', 'col2', 'col3']
df_data = spark.read.format("com.crealytics.spark.excel") \
            .option("header", "true") \
            .option("dataAddress", f"'{sheet_name}'!A1") \
            .option("treatEmptyValuesAsNulls", "false") \
            .schema(schema) \
            .load(file_path)
df_data = df_data.select(desired_order)
df_data.show()

结果：

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  12|  23|NULL|
|  34|  45|  56|
+----+----+----+

在读取 Excel 文件的上述代码中，应用指定的架构，并按所需的顺序选择列。

xlsx - 使用 pyspark 读入 Spark 数据帧时列映射不正确

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1