随机森林分类器 - 将索引标签标签转换回字符串值

问题描述 投票:1回答:1

我正在进行文本分类,并使用管道方法构建了一个模型。

我正在拟合我使用数据框创建的训练数据,它有“标签”和“句子”列。标签有不同的问题类型。 DF看起来像,

training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])

创建管道的代码是 -

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training) 
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")

rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel") 
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)

预测代码是 -

test = sqlContext.createDataFrame([("What is the highest waterfall in the United States ?" ,)], ["sentence"])
prediction = model.transform(test)
selected = prediction.select("sentence", "prediction")

现在,如果我给出命令'selected.show(truncate = False)',那么它将以下列格式显示数据 -

+----------------------------------------------------+----------+
|Question                                            |prediction|
+----------------------------------------------------+----------+
|What is the highest waterfall in the United States ?|2.0       |
+----------------------------------------------------+----------+

问题是我希望预测数据采用标签格式,就像在训练数据中一样。但我得到双重格式的价值。如何将预测值从double转换回字符串?

python pyspark spark-dataframe apache-spark-ml
1个回答
1
投票

有一个IndexToString转换器,提供所需的功能。有关详细信息,请参阅Spark源代码中的scala示例:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala

labeler = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=indexer.labels)
© www.soinside.com 2019 - 2024. All rights reserved.