我正在运行一个 DecisionTree 模型,一切看起来都正确,除了我运行 feature_importance 来检查这个模型中最重要的特征是什么。 结果是错误的,因为特征得分的总和不等于 1,而且我知道最重要的特征是 smoking_status(分类特征)。
的输出list(zip(assembler.getInputCols(), decisionTreeModel.featureImportances))
[('年龄', 0.1717140615500328), ('体重指数', 0.001403579349166339), ('高血压', 0.0), ('心脏病', 0.0), ('avg_glucose_level', 0.007257486022061398), ('性别向量', 0.0), ('smokingVector', 0.0)]
特征重要性结果有什么问题?
谢谢!
这里是完整的代码:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoder
gender_indexer = StringIndexer(inputCol='gender', outputCol='genderIndexer')
gender_encoder = OneHotEncoder(inputCol='genderIndexer', outputCol='genderVector')
smoking_indexer = StringIndexer(inputCol='smoking_status', outputCol='smokingIndexer')
smoking_encoder = OneHotEncoder(inputCol='smokingIndexer', outputCol='smokingVector')
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['age', 'bmi', 'hypertension', 'heart_disease', 'avg_glucose_level', 'genderVector', 'smokingVector' ], outputCol='features')
classifier = DecisionTreeClassifier(labelCol='stroke', featuresCol='features')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[gender_indexer, gender_encoder, smoking_indexer, smoking_encoder, assembler, classifier])
train_data, test_data = strokes.randomSplit([0.7, 0.3])
predictStrokeModel = pipeline.fit(train_data)
result = predictStrokeModel.transform(test_data)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol='stroke', predictionCol='prediction', metricName='accuracy')
accuracy = evaluator.evaluate(result)
decisionTreeModel = predictStrokeModel.stages[5]
decisionTreeModel.depth
decisionTreeModel.toDebugString
list(zip(assembler.getInputCols(), decisionTreeModel.featureImportances))
代码的这个新部分解决了我的问题:
categorical_cols = ['性别', '吸烟状况']
indexers = [StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c)) for c in categorical_cols] encoders = [OneHotEncoder(inputCol=indexer.getOutputCol(), outputCol="{0}_encoded".format(indexer.getOutputCol())) 用于索引器中的索引器] assembler = VectorAssembler(inputCols=['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'gender_indexed', 'smoking_status_indexed'], outputCol='features')
classifier = DecisionTreeClassifier(labelCol='stroke', featuresCol='features')
pipeline = Pipeline(stages=indexers + encoders + [assembler, classifier])