具有feature_column的TensorForestEstimator抛出TypeError

问题描述 投票:0回答:1

我打算在更大的数据集上使用TensorForestEstimator,这些数据集将通过运行在Pandas对象上的input_fn提供。

为了验证我对API的理解,我整理了一个使用UC Irvine Machine Learning Repository数据集的小例子。该数据集有七个特征(六个int32s和一个float32)和一个标签(int32)。

当数据集作为fit()数组直接用evaluate()numpy参数提供时,我可以运行xy

当我尝试使用源自input_fntf.estimator.inputs.pandas_input_fn的数据执行相同的操作并将tf.contrib.layers特征列提供给feature_columns参数时,我在tensorflow/contrib/tensor_forest/python/ops/data_ops.py中观察到值错误:

TypeError: '<' not supported between instances of '_RealValuedColumn' and 'str'

这是因为sorted()被调用在str和TensorFlow对象的字典键列表中。

从Jupyter笔记本导出的代码在本文末尾给出。

任何有关为什么会发生这种情况的见解将不胜感激。我已经完成了相当多的搜索文档,StackOverflow和GitHub问题记录,并且还没有找到根本原因。

提前致谢!

Sample Code for TensorForestEstimator with pandas_input_fn

Python标准库导入

import csv
import numpy as np
import pandas as pd
import random

TensorFlow库导入

import tensorflow as tf
import tensorflow.contrib.layers as layers
import tensorflow.contrib.tensor_forest as tforest

别名TensorFlow图书馆进口

from tensorflow.estimator.inputs import pandas_input_fn
from tensorflow.python.platform import tf_logging as logging

CSV列的元数据

COLUMN_PROPS = {
    'sex' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'sex',
            dtype=tf.int32
        )
    },
    'age' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'age',
            dtype=tf.int32
        )  
    },
    'Time' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.float32,
        'default' : -1.0,
        'feature_column' : layers.real_valued_column(
            'Time',
            dtype=tf.float32
        )
    },
    'Number_of_Warts' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Number_of_Warts',
            dtype=tf.int32
        ),
    },
    'Type' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Type',
            dtype=tf.int32
        )
    },
    'Area' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype' : tf.int32,
        'default' : -1,
        'feature_column' : layers.real_valued_column(
            'Area',
            dtype=tf.int32
        )
    },
    'induration_diameter' : {
        'is_feature' : True,
        'is_label' : False,
        'dtype': tf.int32,
        'default': -1,
        'feature_column' : layers.real_valued_column(
            'induration_diameter',
            dtype=tf.int32
        )
    },
    'Result_of_Treatment': {
        'is_feature' : False,
        'is_label' : True,
        'dtype': tf.int32,
        'default': -1,
        'feature_column' : None
    }
}

CSV列的排序

CSV_COLUMNS = [
    'sex',
    'age',
    'Time',
    'Number_of_Warts',
    'Type',
    'Area',
    'induration_diameter',
    'Result_of_Treatment'
]

从元数据生成功能和标签列表

FEATURE_COLUMNS = []
LABEL_COLUMN = None

for k in CSV_COLUMNS:
    if COLUMN_PROPS[k]['is_feature']:
        FEATURE_COLUMNS.append(k)
    elif COLUMN_PROPS[k]['is_label']:
        LABEL_COLUMN = k

用于混洗和导出子集的辅助函数

此功能用于将训练,评估和测试数据集导出为CSV,对行进行混洗。

def generate_sets(datasets):
    for k, v in datasets.items():
        random.shuffle(v)
        with open(k + '.csv', 'w') as fobj:
            wrtr = csv.writer(fobj)
            wrtr.writerow(header)
            for rec in v:
                wrtr.writerow(rec)

拆分数据集以进行培训,评估和测试

trn = []
evl = []
tst = []

with open('Immunotherapy - ImmunoDataset.csv', 'r') as fobj:
    rdr = csv.reader(fobj)
    header = next(rdr)
    label_key = header[-1]
    feature_keys = header[:-1]

    for rec in rdr:
        # Output of random number generator determines
        # which set the record will be placed.
        rn =  random.random()
        if rn < 0.6:
            trn.append(rec)
        elif rn < 0.8:
            evl.append(rec)
        else:
            tst.append(rec)

datasets = {
    'train' : trn,
    'eval' : evl,
    'test' : tst
}

generate_sets(datasets)

设置TensorForest Hyperparameters

fhp = tforest.tensor_forest.ForestHParams(
    num_classes=2,
    num_features=7,
    regression=False
)

从元数据字典中提取特征列

fcs = [COLUMN_PROPS[k]['feature_column'] for k in FEATURE_COLUMNS]

实例化TensorForestEstimator对象

tfe = tforest.random_forest.TensorForestEstimator(
    fhp,
    feature_columns=fcs,
    report_feature_importances=True
)

pandas_input_fn定义包装器

def get_input_fn(csv_file):

    df = pd.read_csv(csv_file)

    features = df.loc[:,'sex':'induration_diameter']

    # Workaround for this issue:
    #
    # https://stackoverflow.com/questions/48577372/tensorflowusing-pandas-input-fn-with-tensorforestestimator
    # https://github.com/tensorflow/tensorflow/issues/16692

    labels = pd.DataFrame(
        np.expand_dims(
            df.loc[:,'Result_of_Treatment'].values, axis=1
        )
    )

    return pandas_input_fn(x=features, y=labels, shuffle=False)

培训数据

tfe.fit(
    input_fn=get_input_fn('train.csv')
)
python csv numpy classification tflearn
1个回答
0
投票

经过进一步测试,我相信这是TensorForestEstimator的一个错误。更多详细信息可以在此URL的GitHub Issue中找到:

https://github.com/tensorflow/tensorflow/issues/26082

© www.soinside.com 2019 - 2024. All rights reserved.