PySpark Array 不是Array

问题描述 投票:5回答:1

我正在运行一个非常简单的Spark(在Databricks上为2.4.0)ML脚本:

from pyspark.ml.clustering import LDA

lda = LDA(k=10, maxIter=100).setFeaturesCol('features')
model = lda.fit(dataset)

但收到以下错误:

IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type array<double>.'

为什么我的array<double>不是array<double>

这里是架构:

root
 |-- BagOfWords: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)
apache-spark pyspark apache-spark-ml
1个回答
1
投票

您可能需要使用矢量汇编器将其转换为矢量形式from pyspark.ml.feature import VectorAssembler

© www.soinside.com 2019 - 2024. All rights reserved.