Spark - 具有错误特征重要性的决策树模型

问题描述 投票:0回答:1

我正在运行一个 DecisionTree 模型,一切看起来都正确,除了我运行 feature_importance 来检查这个模型中最重要的特征是什么。 结果是错误的,因为特征得分的总和不等于 1,而且我知道最重要的特征是 smoking_status(分类特征)。

的输出

list(zip(assembler.getInputCols(), decisionTreeModel.featureImportances))

[('年龄', 0.1717140615500328), ('体重指数', 0.001403579349166339), ('高血压', 0.0), ('心脏病', 0.0), ('avg_glucose_level', 0.007257486022061398), ('性别向量', 0.0), ('smokingVector', 0.0)]

特征重要性结果有什么问题?

谢谢!

这里是完整的代码:

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoder

gender_indexer = StringIndexer(inputCol='gender', outputCol='genderIndexer')
gender_encoder = OneHotEncoder(inputCol='genderIndexer', outputCol='genderVector')

smoking_indexer = StringIndexer(inputCol='smoking_status', outputCol='smokingIndexer')
smoking_encoder = OneHotEncoder(inputCol='smokingIndexer', outputCol='smokingVector')

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['age', 'bmi', 'hypertension', 'heart_disease', 'avg_glucose_level', 'genderVector', 'smokingVector' ], outputCol='features')

classifier = DecisionTreeClassifier(labelCol='stroke', featuresCol='features')

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[gender_indexer, gender_encoder, smoking_indexer, smoking_encoder, assembler, classifier])

train_data, test_data = strokes.randomSplit([0.7, 0.3])
predictStrokeModel = pipeline.fit(train_data)

result = predictStrokeModel.transform(test_data)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol='stroke', predictionCol='prediction', metricName='accuracy')
accuracy = evaluator.evaluate(result)

decisionTreeModel = predictStrokeModel.stages[5]
decisionTreeModel.depth

decisionTreeModel.toDebugString

list(zip(assembler.getInputCols(), decisionTreeModel.featureImportances))
python apache-spark machine-learning classification decision-tree
1个回答
0
投票

代码的这个新部分解决了我的问题:

定义分类列

categorical_cols = ['性别', '吸烟状况']

定义流水线阶段

indexers = [StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c)) for c in categorical_cols] encoders = [OneHotEncoder(inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) 用于索引器中的索引器] assembler = VectorAssembler(inputCols=['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'gender_indexed', 'smoking_status_indexed'], outputCol='features')

classifier = DecisionTreeClassifier(labelCol='stroke', featuresCol='features')

定义管道

pipeline = Pipeline(stages=indexers + encoders + [assembler, classifier])

© www.soinside.com 2019 - 2024. All rights reserved.