Pyspark DataframeType 错误 a：DoubleType 无法接受类型中的对象“a”<class 'str'>

Question

我有这个功能

customSchema = StructType([ \
    StructField("a", Doubletype(), True), \
    StructField("b", Doubletype(), True),
    StructField("c", Doubletype(), True), 
    StructField("d", Doubletype(), True)])


n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
        .toDF(customSchema)

这将创建一个 Dataframe，问题是“.mapPartitions”将用作默认类型，我需要在将其转换为 Dataframe 之前将其转换为 DoubleType。有什么想法吗？

样本数据

[['0,01', '344,01', '0,00', '0,00']]

或者只是与

合作

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\

Answer 1

首先，需要收集所有元素并使用第二个选项创建一个矩阵（列表的列表）。

n_1= sc.textFile("/path/*.txt")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
 

matrix  = n_1.collect()

一旦我们有了这个，就有必要知道哪种类型的数据进入子列表（在我的例子中是“str”）。

matrix  =[[x.replace(',', '.') for x in i] for i in matrix ] # replace ',' for '.' in order to perform the data type convertion

matrix  = [[float(str(x)) for x in i] for i in matrix  ] #convert every sublist element into float

df = sc.parallelize(matrix).toDF()

Pyspark DataframeType 错误 a：DoubleType 无法接受类型中的对象“a”<class 'str'>

问题描述投票：0回答：1

1个回答

最新问题

Pyspark DataframeType 错误 a：DoubleType 无法接受类型中的对象“a”<class 'str'>

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1