我正在尝试使用spark.read函数从存储位置读取csv文件。另外,我明确地将架构传递给函数。但是,数据未加载到数据框的正确列中。以下是代码详情:
from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType
# Define the schema
schema = StructType([
StructField('TRANSACTION', StringType(), True),
StructField('FROM', StringType(), True),
StructField('TO', StringType(), True),
StructField('DA_RATE', DateType(), True),
StructField('CURNCY_F', StringType(), True),
StructField('CURNCY_T', StringType(), True)
])
# Read the CSV file with the specified schema
df = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter", "|") \
.schema(schema) \
.load("abfss://[email protected]/my/2024-04-10/abc_2*.csv")
**Data in the csv file**
DA_RATE|CURNCY_F|CURNCY_T
2024-02-26|AAA|MMM
2024-02-26|AAA|NNN
2024-02-26|BBB|YYY
2024-02-26|CCC|KKK
2024-02-27|DDD|SSS
Output I am getting
TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T
2024-02-26 AAA MMM null null null
2024-02-26 AAA NNN null null null
2024-02-26 BBB YYY null null null
2024-02-26 CCC KKK null null null
**Output I am expected**
TRANSACTION FROM TO DA_RATE CURNCY_F CURNCY_T
null null null 2024-02-26 AAA MMM
null null null 2024-02-26 AAA NNN
null null null 2024-02-26 BBB YYY
问题在于您的架构字段的顺序。
只需将您的代码更改为:
from pyspark.sql.types import StructType, StructField, StringType, DateType, DoubleType
# Define the schema
schema = StructType([
StructField('DA_RATE', DateType(), True),
StructField('CURNCY_F', StringType(), True),
StructField('CURNCY_T', StringType(), True),
StructField('TRANSACTION', StringType(), True),
StructField('FROM', StringType(), True),
StructField('TO', StringType(), True),
])
# Read the CSV file with the specified schema
df = (
spark.read.format("csv")
.option("header", "true")
.option("delimiter", "|")
.schema(schema)
.load("dbfs:/mnt/dl2-temp-p-chn-1/mycsv.csv")
)
还请使用 pharantesis 而不是反斜杠,因为这是 pep 推荐的;)