随机森林分类器 - 将索引标签标签转换回字符串值

Question

我正在进行文本分类，并使用管道方法构建了一个模型。

我正在拟合我使用数据框创建的训练数据，它有“标签”和“句子”列。标签有不同的问题类型。 DF看起来像，

training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])

创建管道的代码是 -

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training) 
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")

rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel") 
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)

预测代码是 -

test = sqlContext.createDataFrame([("What is the highest waterfall in the United States ?" ,)], ["sentence"])
prediction = model.transform(test)
selected = prediction.select("sentence", "prediction")

现在，如果我给出命令'selected.show（truncate = False）'，那么它将以下列格式显示数据 -

+----------------------------------------------------+----------+
|Question                                            |prediction|
+----------------------------------------------------+----------+
|What is the highest waterfall in the United States ?|2.0       |
+----------------------------------------------------+----------+

问题是我希望预测数据采用标签格式，就像在训练数据中一样。但我得到双重格式的价值。如何将预测值从double转换回字符串？

Answer 1

有一个IndexToString转换器，提供所需的功能。有关详细信息，请参阅Spark源代码中的scala示例：https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala

labeler = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=indexer.labels)

随机森林分类器 - 将索引标签标签转换回字符串值

问题描述投票：1回答：1

1个回答

最新问题

随机森林分类器 - 将索引标签标签转换回字符串值

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1