在模式中指定DateType（）时从RDD创建DataFrame

Question

我正在从RDD创建一个DataFrame，其中一个值是date。我不知道如何在架构中指定DateType()。

让我来说明手头的问题 -

我们可以将date加载到DataFrame中的一种方法是首先将其指定为字符串，然后使用date函数将其转换为正确的to_date()。

from pyspark.sql.types import Row, StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.functions import col, to_date
values=sc.parallelize([(3,'2012-02-02'),(5,'2018-08-08')])
rdd= values.map(lambda t: Row(A=t[0],date=t[1]))

# Importing date as String in Schema
schema = StructType([StructField('A', IntegerType(), True), StructField('date', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

# Finally converting the string into date using to_date() function.
df = df.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
df.show()
+---+----------+
|  A|      date|
+---+----------+
|  3|2012-02-02|
|  5|2018-08-08|
+---+----------+

df.printSchema()
root
 |-- A: integer (nullable = true)
 |-- date: date (nullable = true)

有没有办法，我们可以在DateType()使用schema，并避免必须明确地将string转换为date？

像这样的东西 -

values=sc.parallelize([(3,'2012-02-02'),(5,'2018-08-08')])
rdd= values.map(lambda t: Row(A=t[0],date=t[1]))
# Somewhere we would need to specify date format 'yyyy-MM-dd' too, don't know where though.
schema = StructType([StructField('A', DateType(), True), StructField('date', DateType(), True)])

更新：根据@ user10465355的建议，以下代码有效 -

import datetime
schema = StructType([
  StructField('A', IntegerType(), True),
  StructField('date', DateType(), True)
])
rdd= values.map(lambda t: Row(A=t[0],date=datetime.datetime.strptime(t[1], "%Y-%m-%d")))
df = sqlContext.createDataFrame(rdd, schema)
df.show()
+---+----------+
|  A|      date|
+---+----------+
|  3|2012-02-02|
|  5|2018-08-08|
+---+----------+
df.printSchema()
root
 |-- A: integer (nullable = true)
 |-- date: date (nullable = true)

Answer 1

长话短说，与外部对象的RDD一起使用的模式不打算以这种方式使用 - 声明的类型应该反映数据的实际状态，而不是期望的状态。

换句话说，允许：

schema = StructType([
  StructField('A', IntegerType(), True),
  StructField('date', DateType(), True)
])

对应于date field should use datetime.date的数据。所以例如你的RDD[Tuple[int, str]]：

import datetime

spark.createDataFrame(
    # Since values from the question are just two element tuples
    # we can use mapValues to transform the "value"
    # but in general case you'll need map
    values.mapValues(datetime.date.fromisoformat),
    schema
)

您可以获得所需行为的最接近的是使用RDD[Row]转换数据（dicts）和JSON阅读器

from pyspark.sql import Row

spark.read.schema(schema).json(rdd.map(Row.asDict))

或更好的显式JSON转储：

import json
spark.read.schema(schema).json(rdd.map(Row.asDict).map(json.dumps))

但这当然比明确的演员要贵得多，BTW在你所描述的简单案例中很容易自动化：

from pyspark.sql.functions import col

(spark
    .createDataFrame(values, ("a", "date"))
    .select([col(f.name).cast(f.dataType) for f in schema]))

在模式中指定DateType（）时从RDD创建DataFrame

问题描述投票：1回答：1

1个回答

最新问题

在模式中指定DateType（）时从RDD创建DataFrame

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1