使用 Spark.read 读取 csv 文件时,数据在传递架构时未加载到适当的列中

问题描述 投票:0回答:1

我正在尝试使用spark.read函数从存储位置读取csv文件。另外,我明确地将架构传递给函数。但是,数据未加载到数据框的正确列中。以下是代码详情:

from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType

# Define the schema
schema = StructType([
    StructField('TRANSACTION', StringType(), True),
    StructField('FROM', StringType(), True),
    StructField('TO', StringType(), True),
    StructField('DA_RATE', DateType(), True),
    StructField('CURNCY_F', StringType(), True),
    StructField('CURNCY_T', StringType(), True)
])

# Read the CSV file with the specified schema
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("delimiter", "|") \
    .schema(schema) \
    .load("abfss://[email protected]/my/2024-04-10/abc_2*.csv")

**Data in the csv file**

DA_RATE|CURNCY_F|CURNCY_T
2024-02-26|AAA|MMM
2024-02-26|AAA|NNN
2024-02-26|BBB|YYY
2024-02-26|CCC|KKK
2024-02-27|DDD|SSS


Output I am getting

TRANSACTION FROM    TO   DA_RATE    CURNCY_F CURNCY_T
2024-02-26  AAA     MMM  null   null     null 
2024-02-26  AAA     NNN  null   null     null
2024-02-26  BBB     YYY  null   null     null
2024-02-26  CCC     KKK  null   null     null

**Output I am expected**

TRANSACTION  FROM    TO      DA_RATE    CURNCY_F   CURNCY_T
null         null   null    2024-02-26   AAA         MMM    
null         null   null    2024-02-26   AAA         NNN    
null         null   null    2024-02-26   BBB         YYY    
csv pyspark azure-databricks
1个回答
0
投票

问题在于您的架构字段的顺序。

只需将您的代码更改为:


from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType

# Define the schema
schema = StructType([
    StructField('DA_RATE', DateType(), True),
    StructField('CURNCY_F', StringType(), True),
    StructField('CURNCY_T', StringType(), True),
    StructField('TRANSACTION', StringType(), True),
    StructField('FROM', StringType(), True),
    StructField('TO', StringType(), True),
    
])

# Read the CSV file with the specified schema
df = (
    spark.read.format("csv")
    .option("header", "true")
    .option("delimiter", "|")
    .schema(schema)
    .load("dbfs:/mnt/dl2-temp-p-chn-1/mycsv.csv")
)

还请使用 pharantesis 而不是反斜杠,因为这是 pep 推荐的;)

© www.soinside.com 2019 - 2024. All rights reserved.