我正在使用GLM(在Spark 2.0中使用ML)对具有一个分类自变量的数据运行模型。我正在使用StringIndexer
和OneHotEncoder
将该列转换为虚拟变量,然后使用VectorAssembler
将其与连续的自变量组合成稀疏矢量列。
如果我的列名是continuous
和categorical
,其中第一列是浮点列,第二列是表示(在本例中为8)不同类别的字符串列:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
到目前为止一切正常,我运行模型:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
哪个输出:
DenseVector([8440.0573,3729.449,4388.9042,2871.1802,4613.7646,5163.3233,5186.6189,5513.1392])
这很好,因为我可以验证这些系数基本上是正确的(通过其他来源)。但是,我还没有找到一个很好的方法将这些系数链接到我需要做的原始列名称(我已经为SO简化了这个模型;还有更多参与。)
列名和系数之间的关系由StringIndexer
和OneHotEncoder
打破。我找到了一个相当缓慢的方法:
df[['categorical', 'categorical_index']].distinct()
这给了我一个关于字符串名称和数字名称的小数据框,我想我可以将其与稀疏向量中的键相关联?当你考虑数据的规模时,这是非常笨重和缓慢的。
有一个更好的方法吗?
我也遇到了确切的问题,我有你的解决方案:)
这是基于Scala版本:How to map variable names to features after pipeline
# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)
# extract features metadata
meta = [f.metadata
for f in best_pred.schema.fields
if f.name == 'features'][0]
# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
meta['ml_attr']['attrs']['binary']
print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]
对不起,这似乎是一个非常晚的答案,也许你可能已经想出来但是,无论如何。我最近做了相同的字符串索引器,OneHotEncoder和VectorAssembler的实现,据我所知,以下代码将呈现您正在寻找的内容。
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+"Index")
# Using OneHotEncoder to convert categorical variables into binary
SparseVectors
encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=categoricalCol+"classVec")
# Adding the stages so that they will be run all at once later
stages += [stringIndexer, encoder]
# convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol =
"label")
stages += [label_stringIdx]
# Transform all features into a vector using VectorAssembler
numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Creating a Pipeline for Training
pipeline = Pipeline(stages=stages)
# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)