对象的长度 (3) 与字段的长度 (1) Pyspark

问题描述 投票:0回答:2

我遇到以下代码问题。我想创建一个单列数据框。 我可以知道我在这里做错了什么吗?

from pyspark.sql import functions as F from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType data = [ (["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True)  ])
df = spark.createDataFrame(data=data,schema=cols)
df.printSchema()
df.show()

output:
Name
["James","Jon","Jane"]
["Miken","Mik","Mike"]
["John","Johns"]


我在下面收到一个错误。 对象 (3) 的长度与字段 (1) 的长度不匹配

python apache-spark pyspark apache-spark-sql
2个回答
0
投票

此错误是因为您以

multiple-column structure
的形式添加了数据,而您的要求是单列数据值。

因此,要获取单列值中的数据,您需要

[(row,) for row in data]

data = [(["James","Jon","Jane"]), (["Miken","Mik","Mike"]), (["John","Johns"])]
cols = StructType([ StructField("Name",ArrayType(StringType()),True)  ])
df = spark.createDataFrame(data=[(row,) for row in data], schema=cols)
df.printSchema()
df.show()

输出:


0
投票

Pyspark 有这个问题。我的方法是引入 ID 列,将其删除,创建 df

from pyspark.sql import functions as F 
from pyspark.sql.types import IntegerType,ArrayType,StructType,StructField,StringType 

data = [ (1,["James","Jon","Jane"]), (2,["Miken","Mik","Mike"]), (3,["John","Johns"])]

cols = StructType([ StructField("ID",IntegerType(),True), StructField("Name",ArrayType(StringType()),True)  ])

df = spark.createDataFrame(data=data,schema=cols).drop('ID')
df.printSchema()
df.show() 
© www.soinside.com 2019 - 2024. All rights reserved.