Spark随机森林 - 无法将float转换为int错误

问题描述 投票:0回答:1

我有数字和二进制响应的功能。我正在尝试构建集合决策树,例如随机森林和渐变提升树。但是,我收到一个错误。我用虹膜数据重现了这个错误。错误如下,整个错误消息位于底部。

TypeError:无法将12.631578947368421转换为int

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()
y = list(iris.target)
df = pd.read_csv("https://raw.githubusercontent.com/venky14/Machine- Learning-with-Iris-Dataset/master/Iris.csv")
df = df.drop(['Species'], axis = 1)
df['label'] = y
spark_df = spark.createDataFrame(df).drop('Id')
cols = spark_df.drop('label').columns
assembler = VectorAssembler(inputCols = cols, outputCol = 'features')
output_dat = assembler.transform(spark_df).select('label', 'features')

rf = RandomForestClassifier(labelCol = "label", featuresCol = "features")
paramGrid_rf = ParamGridBuilder() \
                     .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
                     .addGrid(rf.numTrees, np.linspace(10, 60, 20)).build()

crossval_rf = CrossValidator(estimator = rf,
                         estimatorParamMaps = paramGrid_rf,
                         evaluator = BinaryClassificationEvaluator(),
                         numFolds = 5) 

cvModel_rf = crossval_rf.fit(output_dat)

TypeError                                 Traceback (most recent call last)
<ipython-input-24-44f8f759ed8e> in <module>
      2 paramGrid_rf = ParamGridBuilder() \
      3    .addGrid(rf.maxDepth, np.linspace(5, 30, 6)) \
----> 4    .addGrid(rf.numTrees, np.linspace(10, 60, 20)) \
      5    .build()
      6 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in build(self)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
--> 122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
    123 
    124 

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in to_key_value_pairs(keys, values)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/tuning.py in <listcomp>(.0)
    118 
    119         def to_key_value_pairs(keys, values):
--> 120             return [(key, key.typeConverter(value)) for key, value in zip(keys, values)]
    121 
    122         return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]

~/spark-2.4.0-bin-hadoop2.7/python/pyspark/ml/param/__init__.py in toInt(value)
    197             return int(value)
    198         else:
--> 199             raise TypeError("Could not convert %s to int" % value)
    200 
    201     @staticmethod

TypeError: Could not convert 12.631578947368421 to int```
numpy machine-learning pyspark random-forest apache-spark-ml
1个回答
2
投票

maxDepthnumTrees都需要是整数; Numpy linspace procudes浮动:

import numpy as np
np.linspace(10, 60, 20)

结果:

array([ 10.        ,  12.63157895,  15.26315789,  17.89473684,
        20.52631579,  23.15789474,  25.78947368,  28.42105263,
        31.05263158,  33.68421053,  36.31578947,  38.94736842,
        41.57894737,  44.21052632,  46.84210526,  49.47368421,
        52.10526316,  54.73684211,  57.36842105,  60.        ])

因此,您的代码会碰到第一个非整数值(此处为12.63157895),并产生错误。

使用arange代替:

np.arange(10, 60, 20)
# array([10, 30, 50])
© www.soinside.com 2019 - 2024. All rights reserved.