使用MLlib缩放数据集

问题描述 投票:0回答:1

我正在使用spark MLlib在下面的数据集上进行一些缩放:-


    +---+--------------+
    | id|      features|
    +---+--------------+
    |  0|[1.0,0.1,-1.0]|
    |  1| [2.0,1.1,1.0]|
    |  0|[1.0,0.1,-1.0]|
    |  1| [2.0,1.1,1.0]|
    |  1|[3.0,10.1,3.0]|
    +---+--------------+

您可以在https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet处找到该数据集的链接>

执行标准缩放后,我得到的结果如下:-


    +---+--------------+------------------------------------------------------------+
    |id |features      |stdScal_06f7a85f98ef__output                                |
    +---+--------------+------------------------------------------------------------+
    |0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
    |1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
    |0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
    |1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
    |1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
    +---+--------------+------------------------------------------------------------+

并且如果我执行最小/最大缩放**(设置val minMax = new MinMaxScaler()。setMin(5).setMax(10).setInputCol(“ features”))**我得到以下信息:-


    +---+--------------+-------------------------------+
    | id|      features|minMaxScal_21493d63e2bf__output|
    +---+--------------+-------------------------------+
    |  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
    |  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
    |  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
    |  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
    |  1|[3.0,10.1,3.0]|               [10.0,10.0,10.0]|
    +---+--------------+-------------------------------+

请在下面找到代码:-

```
// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features") 
ss.fit(scaleDF).transform(scaleDF).show(false)

// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features") 
val fittedminMax = minMax.fit(scaleDF) 
fittedminMax.transform(scaleDF).show()
```

我知道标准化和最小/最大缩放的公式,但无法理解第三栏中的值,请帮助我解释其背后的数学。

我正在使用spark MLlib在下面的数据集上进行缩放:-+ --- + -------------- + | id |功能| + --- + -------------- + | 0 | [1.0,0.1,-1.0] | | 1 | [2.0,1.1,1.0] | | 0 | [1 ....

scala apache-spark machine-learning apache-spark-mllib
1个回答
0
投票

MinMaxScaler在Spark中分别对每个功能起作用。从文档中我们可以找到:

© www.soinside.com 2019 - 2024. All rights reserved.