我有以下转换数据。
dataframe: rev
+--------+------------------+
|features| label|
+--------+------------------+
| [24.0]| 6.382551510879452|
| [29.0]| 6.233604067150788|
| [35.0]|15.604956217859785|
+--------+------------------+
当我将它分成两组时,我会得到一些非常意外的东西。对不起,我是PySpark的新手。
(trainingData, testData) = rev.randomSplit([0.7, 0.3])
现在,当我检查时,我发现:
trainingData.show(3)
+--------+--------------------+
|features| label|
+--------+--------------------+
| [22.0]|0.007807592294154144|
| [22.0]|0.016228017481755445|
| [22.0]|0.029326273621380787|
+--------+--------------------+
不幸的是,当我运行模型并检查测试集上的预测时,我得到以下信息:
+------------------+--------------------+--------+
| prediction| label|features|
+------------------+--------------------+--------+
|11.316183853894138|0.023462300065135114| [22.0]|
|11.316183853894138| 0.02558467547137103| [22.0]|
|11.316183853894138| 0.03734394063419729| [22.0]|
|11.316183853894138| 0.07660100900324195| [22.0]|
|11.316183853894138| 0.08032742812331381| [22.0]|
+------------------+--------------------+--------+
Prediction and Label are in horrible relationship.
提前致谢。
更新信息:
整个数据集:
rev.describe().show()
+-------+--------------------+
|summary| label|
+-------+--------------------+
| count| 28755967|
| mean| 11.326884020257475|
| stddev| 6.0085535870540125|
| min|5.158072668697356E-4|
| max| 621.5236222433649|
+-------+--------------------+
和火车集:
+-------+--------------------+
|summary| label|
+-------+--------------------+
| count| 20132404|
| mean| 11.327304652511287|
| stddev| 6.006384709888342|
| min|5.158072668697356E-4|
| max| 294.9624797344751|
+-------+--------------------+
尝试设置种子pyspark.sql.DataFrame.randomSplit
(trainingData, testData) = rev.randomSplit([7.0, 3.0], 100)