从PySpark了解MLlib的分割功能

问题描述 投票:0回答:1

我有以下转换数据。

dataframe: rev

+--------+------------------+
|features|             label|
+--------+------------------+
|  [24.0]| 6.382551510879452|
|  [29.0]| 6.233604067150788|
|  [35.0]|15.604956217859785|
+--------+------------------+

当我将它分成两组时,我会得到一些非常意外的东西。对不起,我是PySpark的新手。

(trainingData, testData) = rev.randomSplit([0.7, 0.3])

现在,当我检查时,我发现:

trainingData.show(3)

+--------+--------------------+
|features|               label|
+--------+--------------------+
|  [22.0]|0.007807592294154144|
|  [22.0]|0.016228017481755445|
|  [22.0]|0.029326273621380787|
+--------+--------------------+

不幸的是,当我运行模型并检查测试集上的预测时,我得到以下信息:

+------------------+--------------------+--------+
|        prediction|               label|features|
+------------------+--------------------+--------+
|11.316183853894138|0.023462300065135114|  [22.0]|
|11.316183853894138| 0.02558467547137103|  [22.0]|
|11.316183853894138| 0.03734394063419729|  [22.0]|
|11.316183853894138| 0.07660100900324195|  [22.0]|
|11.316183853894138| 0.08032742812331381|  [22.0]|
+------------------+--------------------+--------+

Prediction and Label are in horrible relationship. 

提前致谢。

更新信息:

整个数据集:

rev.describe().show()

+-------+--------------------+
|summary|               label|
+-------+--------------------+
|  count|            28755967|
|   mean|  11.326884020257475|
| stddev|  6.0085535870540125|
|    min|5.158072668697356E-4|
|    max|   621.5236222433649|
+-------+--------------------+

和火车集:

+-------+--------------------+
|summary|               label|
+-------+--------------------+
|  count|            20132404|
|   mean|  11.327304652511287|
| stddev|   6.006384709888342|
|    min|5.158072668697356E-4|
|    max|   294.9624797344751|
+-------+--------------------+
pyspark linear-regression apache-spark-mllib
1个回答
0
投票

尝试设置种子pyspark.sql.DataFrame.randomSplit

(trainingData, testData)  = rev.randomSplit([7.0, 3.0], 100)

© www.soinside.com 2019 - 2024. All rights reserved.