我想将PySpark的MinMaxScalar
应用于PySpark数据帧df
的多列。到目前为止,我只知道如何将其应用于单个列,例如x
。
from pyspark.ml.feature import MinMaxScaler
pdf = pd.DataFrame({'x':range(3), 'y':[1,2,5], 'z':[100,200,1000]})
df = spark.createDataFrame(pdf)
scaler = MinMaxScaler(inputCol="x", outputCol="x")
scalerModel = scaler.fit(df)
scaledData = scalerModel.transform(df)
如果我有100列怎么办?有没有办法对PySpark中的许多列进行最小-最大缩放?
更新:
而且,如何将MinMaxScalar
应用于整数或双精度值?它引发以下错误:
java.lang.IllegalArgumentException: requirement failed: Column length must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually int.
您可以使用管道:
from pyspark.ml import Pipeline
from pyspark.ml.feature import MinMaxScaler
columns_to_scale = ["x", "y", "z"]
scalers = [MinMaxScaler(inputCol=col, outputCol=col) for col in columns_to_scale]
pipeline = Pipeline(stages=scalers)
scalerModel = pipeline.fit(df)
scaledData = scalerModel.transform(df)
检查官方文档中的this example pipeline。