Spark ML中的尺寸不匹配错误

问题描述 投票:3回答:2

我对ML和Spark ML都很陌生,我正在尝试使用带有Spark ML的神经网络制作预测模型,但是当我在学习模型上调用.transform方法时,我得到了这个错误。问题是由OneHotEncoder的使用引起的,因为没有它一切正常。我已经尝试将OneHotEncoder从管道中取出。

我的问题是:我如何使用OneHotEncoder而不会出现此错误?

 java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
 at scala.Predef$.require(Predef.scala:224)     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41)   at
 org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163)     at
 org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482)  at
 org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)

我的代码:

test_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)

joined = train_df.union(test_df)

assembler = VectorAssembler().setInputCols(features).setOutputCol("features")

label_indexer = StringIndexer().setInputCol(
    "label").setOutputCol("label_index")

label_indexer_fit = [label_indexer.fit(joined)]

string_indexers = [StringIndexer().setInputCol(
    name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]

one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
    name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])

mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
    assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
                                + string_indexers + [one_hot_pipeline] + [assembler, mlp])

model = pipeline.fit(train_df)

# compute accuracy on the test set
result = model.transform(test_df)

## FAILS ON RESULT

predictionAndLabels = result.select("prediction", "label_index")

evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"

谢谢!

python apache-spark machine-learning pyspark apache-spark-ml
2个回答
3
投票

您模型中的layers Param不正确:

setLayers([len(features), 20, 10, 2])

第一层应该反映输入要素的数量,这些要素通常与编码前的原始列数不同。

如果您事先不了解功能的总数,则可以单独进行功能提取和模型培训。伪代码:

feature_pipeline_model = (Pipeline()
     .setStages(...)  # Only feature extraction
     .fit(train_df))

train_df_features = feature_pipeline_model.transform(train_df)
layers = [
    train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
    20, 10, 2
]

-1
投票

我有同样的问题,并采取了更手动的方法来user6910411建议。所以例如我有

layers = [**100**, 100, 100 ,100] 

但我的输入变量数实际上是199,所以我改为

layers = [**199**, 100, 100 ,100] 

问题似乎解决了。 :-D

© www.soinside.com 2019 - 2024. All rights reserved.