将列名称与pySpark ML中的模型参数相关联

问题描述 投票:5回答:2

我正在使用GLM(在Spark 2.0中使用ML)对具有一个分类自变量的数据运行模型。我正在使用StringIndexerOneHotEncoder将该列转换为虚拟变量,然后使用VectorAssembler将其与连续的自变量组合成稀疏矢量列。

如果我的列名是continuouscategorical,其中第一列是浮点列,第二列是表示(在本例中为8)不同类别的字符串列:

string_indexer = StringIndexer(inputCol='categorical', 
                               outputCol='categorical_index')

encoder = OneHotEncoder(inputCol ='categorical_index',
                        outputCol='categorical_vector')

assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
                            outputCol='indep_vars')

pipeline  = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)

到目前为止一切正常,我运行模型:

glm = GeneralizedLinearRegression(family='gaussian', 
                                  link='identity',
                                  labelCol='dep_var',
                                  featuresCol='indep_vars')
model = glm.fit(df)
model.params

哪个输出:

DenseVector([8440.0573,3729.449,4388.9042,2871.1802,4613.7646,5163.3233,5186.6189,5513.1392])

这很好,因为我可以验证这些系数基本上是正确的(通过其他来源)。但是,我还没有找到一个很好的方法将这些系数链接到我需要做的原始列名称(我已经为SO简化了这个模型;还有更多参与。)

列名和系数之间的关系由StringIndexerOneHotEncoder打破。我找到了一个相当缓慢的方法:

df[['categorical', 'categorical_index']].distinct()

这给了我一个关于字符串名称和数字名称的小数据框,我想我可以将其与稀疏向量中的键相关联?当你考虑数据的规模时,这是非常笨重和缓慢的。

有一个更好的方法吗?

python pyspark apache-spark-ml
2个回答
2
投票

我也遇到了确切的问题,我有你的解决方案:)

这是基于Scala版本:How to map variable names to features after pipeline

# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)

# extract features metadata
meta = [f.metadata 
    for f in best_pred.schema.fields 
    if f.name == 'features'][0]

# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
    meta['ml_attr']['attrs']['binary']

print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]

0
投票

对不起,这似乎是一个非常晚的答案,也许你可能已经想出来但是,无论如何。我最近做了相同的字符串索引器,OneHotEncoder和VectorAssembler的实现,据我所知,以下代码将呈现您正在寻找的内容。

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline


for categoricalCol in categoricalColumns:

# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol, 
    outputCol=categoricalCol+"Index")

# Using OneHotEncoder to convert categorical variables into binary 
    SparseVectors

encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(), 
    outputCol=categoricalCol+"classVec")

# Adding the stages so that they will be run all at once later

stages += [stringIndexer, encoder]

# convert label into label indices using the StringIndexer

label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol = 
    "label")
stages += [label_stringIdx]

# Transform all features into a vector using VectorAssembler

numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + 
    numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Creating a Pipeline for Training

pipeline = Pipeline(stages=stages)

# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
© www.soinside.com 2019 - 2024. All rights reserved.