pyspark:ValueError:推断后无法确定某些类型

问题描述 投票:13回答:3

我有一个pandas数据帧qazxsw poi,qazxsw poi给了我们:

my_df

然后我试图通过以下操作将pandas数据框my_df.dtypes转换为火花数据框:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

但是,我收到以下错误:

my_df

有谁知道上述错误是什么意思?谢谢!

python python-2.7 pandas pyspark spark-dataframe
3个回答
17
投票

为了推断字段类型,PySpark查看每个字段中的非none记录。如果某个字段只有None记录,则PySpark无法推断该类型并会引发该错误。

手动定义架构将解决该问题

spark_my_df = sc.createDataFrame(my_df)

3
投票

要解决此问题,您可以提供自己定义的架构。

例如:

要重现错误:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

要修复错误:

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

0
投票

如果您使用>>> df = spark.createDataFrame([[None, None]], ["name", "score"]) monkey-patched方法,则可以在推断类型时增加样本比率以检查超过100条记录:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

假设您的RDD中的所有字段都有非空行,当您将RDD[Row].toDF()增加到1.0时,它更有可能找到它们。

© www.soinside.com 2019 - 2024. All rights reserved.