pyspark 架构不匹配问题

Question

我正在尝试使用以下代码将 .csv 文件加载到 Spark 中：

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Window


spark = SparkSession.builder.appName('Demo').master('local').getOrCreate()

pathData = '/home/data/departuredelays.csv'

schema = StructType([
            StructField('date', StringType()),
            StructField('delay', IntegerType()),
            StructField('distance', IntegerType()),
            StructField('origin', StringType()),
            StructField('destination', StringType()),
            ])

flightsDelatDf = (spark
                  .read
                  .format('csv')
                  .option('path', pathData)
                  .option('header', True)
                  .option("schema", schema)
                  .load()
                  )

当我检查架构时，我看到列delay和distance显示为类型

string

，而在架构中，我将它们定义为

integers

flightsDelatDf.printSchema()

root
 |-- date: string (nullable = true)
 |-- delay: string (nullable = true)
 |-- distance: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)

但是如果我使用

.schema(schema)

而不是使用

.option('schema', schema)

来读取文件来指定架构：

flightsDelatDf = (spark
                  .read
                  .format('csv')
                  .option('path', pathData)
                  .option('header', True)
                  .schema(schema)
                  .load()
                  )

我看到列数据类型与我指定的一致。

flightsDelatDf.printSchema()

root
 |-- date: string (nullable = true)
 |-- delay: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- destination: string (nullable = true)

有谁知道为什么在第一种类型中，数据类型与定义的模式不一致，而在第二种类型中？预先感谢。

Answer 1

不正确的方法（option("schema", schema)）：由于 .option() 并不是直接分析和应用 schema 对象，因此 StructType schema 不以这种方式应用。由于 .option() 专为简单的键值设置而设计，PySpark 默认采用 CSV 文件中的架构，这经常会导致基于 CSV 文件内容的数据类型不准确。

正确的技巧是模式（schema）；它专门用于接收 StructType 架构，并在从 CSV 文件读取数据时将其应用到 DataFrame。这保证了 DataFrame 中的列具有与架构中列出的相同的数据类型。

pyspark 架构不匹配问题

问题描述投票：0回答：1

1个回答

最新问题

pyspark 架构不匹配问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1