Spark中的MinMaxScaler引发java.lang.IllegalArgumentException错误

问题描述 投票:0回答:1

我想将PySpark的MinMaxScalar应用于PySpark数据帧df的多列。到目前为止,我只知道如何将其应用于单个列,例如x

from pyspark.ml.feature import MinMaxScaler

pdf = pd.DataFrame({'x':range(3), 'y':[1,2,5], 'z':[100,200,1000]})
df = spark.createDataFrame(pdf)

scaler = MinMaxScaler(inputCol="x", outputCol="x")
scalerModel = scaler.fit(df)
scaledData = scalerModel.transform(df)

如果我有100列怎么办?有没有办法对PySpark中的许多列进行最小-最大缩放?

更新:

而且,如何将MinMaxScalar应用于整数或双精度值?它引发以下错误:

java.lang.IllegalArgumentException: requirement failed: Column length must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually int.
python pyspark apache-spark-sql
1个回答
1
投票

您可以使用管道:

from pyspark.ml import Pipeline
from pyspark.ml.feature import MinMaxScaler
columns_to_scale = ["x", "y", "z"]
scalers = [MinMaxScaler(inputCol=col, outputCol=col) for col in columns_to_scale]
pipeline = Pipeline(stages=scalers)
scalerModel = pipeline.fit(df)
scaledData = scalerModel.transform(df)

检查官方文档中的this example pipeline

© www.soinside.com 2019 - 2024. All rights reserved.