Spark分类预测中的索引返回概率

Question

我正在尝试在Spark的分类预测中回溯预测概率。我有带有标签红色，绿色，蓝色。

的多类分类器的输入数据。

输入数据框：

+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
|  _c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|
+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
|  red|  0|  0|  0|  1|  0|  0|  0|  2|  3|   2|   2|   0|   5|
|green|  5|  6|  0| 14|  0|  5|  0| 95|  2| 120|   0|   0|   9|
|green|  6|  1|  0|  3|  0|  4|  0| 21| 22|  11|   0|   0|  23|
|  red|  0|  1|  0|  1|  0|  4|  0|  1|  4|   2|   0|   0|   5|
|green| 37|  9|  0| 19|  0| 31|  0| 87|  9| 108|   0|   0| 170|
+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
only showing top 5 rows

我使用StringIndexer为标签列建立索引，并使用VectorAssembler从特征列创建特征向量。

已解析的数据框：

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(13,[3,7,8,9,10,1...|
|  0.0|[5.0,6.0,0.0,14.0...|
|  0.0|[6.0,1.0,0.0,3.0,...|
|  1.0|(13,[1,3,5,7,8,9,...|
|  0.0|[37.0,9.0,0.0,19....|
+-----+--------------------+
only showing top 5 rows

使用此数据生成随机森林分类模型。查询时，我将提供要素列来预测标签及其概率。

查询数据框：

+---+---+---+---+---+---+---+---+---+---+----+----+----+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|
+---+---+---+---+---+---+---+---+---+---+----+----+----+
| 11| 11|  0| 23|  0|  7|  2| 70| 81| 76|   7|   0|  23|
|  4|  0|  0|  0|  0|  0|  2|  2|  3|  2|   7|   0|   2|
+---+---+---+---+---+---+---+---+---+---+----+----+----+

已解析的查询数据框：

+--------------------+--------------------+
|          queryValue|            features|
+--------------------+--------------------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...|
+--------------------+--------------------+

来自RFCModel的原始预测：

+--------------------+--------------------+--------------------+----------+
|          queryValue|            features|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...|        [0.67, 0.32]|       0.0|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...|        [0.05, 0.94]|       1.0|
+--------------------+--------------------+--------------------+----------+

在原始预测中，概率列是在对应类索引中具有概率的双精度数组。假设如果概率列中的行为[0.67,0.32]，则表示类别[[0.0的概率为0.67，类别1.0的概率为0.32。仅当标签为0,1,2 ...时，概率列才有意义。在这种情况下，当我使用IndexToString将预测索引回原始标签时，概率列将毫无意义。

已建立索引的数据帧：
+--------------------+--------------------+--------------------+----------+ | queryValue| features| probability|prediction| +--------------------+--------------------+--------------------+----------+ |11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...| [0.67, 0.32]| green| |4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...| [0.05, 0.94]| red| +--------------------+--------------------+--------------------+----------+
我想像下面那样索引概率列，
+--------------------+--------------------+--------------------------+----------+ | queryValue| features| probability |prediction| +--------------------+--------------------+--------------------------+----------+ |11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...|{"red":0.32,"green":0.67} | green| |4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...|{"red":0.94,"green":0.05} | red| +--------------------+--------------------+--------------------------+----------+
目前，我通过将数据框转换为List来为概率列建立索引。火花中有没有可用的功能转换器来做到这一点？

Answer 1

尝试使用以下方法解决此问题-

我用Iris data解决了这个问题。

样本输入（前5行）

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|      label|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+

从StringIndexerModel那里获取带有索引的标签

您提到-

我使用StringIndexer索引标签列，并使用VectorAssembler从要素列创建要素向量。

我们将在这里使用stringIndexerModel获得Map[index, Label]

// in my case, StringIndexerModel is referenced as labelIndexer val labelToIndex = labelIndexer.labels.zipWithIndex.map(_.swap).toMap println(labelToIndex)

结果-Map(0 -> Iris-setosa, 1 -> Iris-versicolor, 2 -> Iris-virginica)

使用此地图生成概率json
  import org.apache.spark.ml.linalg.Vector
  val mapToLabel = udf((vector: Vector) => vector.toArray.zipWithIndex.toMap.map{
      case(prob, index) => labelToIndex(index) -> prob
    })
    predictions.select(
      col("features"),
      col("probability"),
      to_json(mapToLabel(col("probability"))).as("probability_json"),
      col("prediction"),
      col("predictedLabel"))
      .show(5,false)

结果-+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
|features                             |probability                                                 |probability_json                                                                                             |prediction|predictedLabel|
+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
|(123,[0,37,82,101],[1.0,1.0,1.0,1.0])|[0.7094347002635046,0.174338768115942,0.11622653162055337]  |{"Iris-setosa":0.7094347002635046,"Iris-versicolor":0.174338768115942,"Iris-virginica":0.11622653162055337}  |0.0       |Iris-setosa   |
|(123,[0,39,58,101],[1.0,1.0,1.0,1.0])|[0.7867074275362319,0.12433876811594202,0.0889538043478261] |{"Iris-setosa":0.7867074275362319,"Iris-versicolor":0.12433876811594202,"Iris-virginica":0.0889538043478261} |0.0       |Iris-setosa   |
|(123,[0,39,62,107],[1.0,1.0,1.0,1.0])|[0.5159492704509036,0.2794443583750028,0.2046063711740936]  |{"Iris-setosa":0.5159492704509036,"Iris-versicolor":0.2794443583750028,"Iris-virginica":0.2046063711740936}  |0.0       |Iris-setosa   |
|(123,[2,39,58,101],[1.0,1.0,1.0,1.0])|[0.7822379507920459,0.12164981462756994,0.09611223458038423]|{"Iris-setosa":0.7822379507920459,"Iris-versicolor":0.12164981462756994,"Iris-virginica":0.09611223458038423}|0.0       |Iris-setosa   |
|(123,[2,43,62,101],[1.0,1.0,1.0,1.0])|[0.7049652235193186,0.17164981462756992,0.1233849618531115] |{"Iris-setosa":0.7049652235193186,"Iris-versicolor":0.17164981462756992,"Iris-virginica":0.1233849618531115} |0.0       |Iris-setosa   |
+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
only showing top 5 rows

Spark分类预测中的索引返回概率

问题描述投票：1回答：1

1个回答

我使用StringIndexer索引标签列，并使用VectorAssembler从要素列创建要素向量。

最新问题

Spark分类预测中的索引返回概率

问题描述 投票：1回答：1

1个回答

我使用StringIndexer索引标签列，并使用VectorAssembler从要素列创建要素向量。

最新问题

问题描述投票：1回答：1