我打算在更大的数据集上使用TensorForestEstimator
,这些数据集将通过运行在Pandas对象上的input_fn
提供。
为了验证我对API的理解,我整理了一个使用UC Irvine Machine Learning Repository数据集的小例子。该数据集有七个特征(六个int32
s和一个float32
)和一个标签(int32
)。
当数据集作为fit()
数组直接用evaluate()
和numpy
参数提供时,我可以运行x
和y
。
当我尝试使用源自input_fn
的tf.estimator.inputs.pandas_input_fn
的数据执行相同的操作并将tf.contrib.layers
特征列提供给feature_columns
参数时,我在tensorflow/contrib/tensor_forest/python/ops/data_ops.py
中观察到值错误:
TypeError: '<' not supported between instances of '_RealValuedColumn' and 'str'
这是因为sorted()
被调用在str
和TensorFlow对象的字典键列表中。
从Jupyter笔记本导出的代码在本文末尾给出。
任何有关为什么会发生这种情况的见解将不胜感激。我已经完成了相当多的搜索文档,StackOverflow和GitHub问题记录,并且还没有找到根本原因。
提前致谢!
TensorForestEstimator
with pandas_input_fn
import csv
import numpy as np
import pandas as pd
import random
import tensorflow as tf
import tensorflow.contrib.layers as layers
import tensorflow.contrib.tensor_forest as tforest
from tensorflow.estimator.inputs import pandas_input_fn
from tensorflow.python.platform import tf_logging as logging
COLUMN_PROPS = {
'sex' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'sex',
dtype=tf.int32
)
},
'age' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'age',
dtype=tf.int32
)
},
'Time' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.float32,
'default' : -1.0,
'feature_column' : layers.real_valued_column(
'Time',
dtype=tf.float32
)
},
'Number_of_Warts' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Number_of_Warts',
dtype=tf.int32
),
},
'Type' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Type',
dtype=tf.int32
)
},
'Area' : {
'is_feature' : True,
'is_label' : False,
'dtype' : tf.int32,
'default' : -1,
'feature_column' : layers.real_valued_column(
'Area',
dtype=tf.int32
)
},
'induration_diameter' : {
'is_feature' : True,
'is_label' : False,
'dtype': tf.int32,
'default': -1,
'feature_column' : layers.real_valued_column(
'induration_diameter',
dtype=tf.int32
)
},
'Result_of_Treatment': {
'is_feature' : False,
'is_label' : True,
'dtype': tf.int32,
'default': -1,
'feature_column' : None
}
}
CSV_COLUMNS = [
'sex',
'age',
'Time',
'Number_of_Warts',
'Type',
'Area',
'induration_diameter',
'Result_of_Treatment'
]
FEATURE_COLUMNS = []
LABEL_COLUMN = None
for k in CSV_COLUMNS:
if COLUMN_PROPS[k]['is_feature']:
FEATURE_COLUMNS.append(k)
elif COLUMN_PROPS[k]['is_label']:
LABEL_COLUMN = k
此功能用于将训练,评估和测试数据集导出为CSV,对行进行混洗。
def generate_sets(datasets):
for k, v in datasets.items():
random.shuffle(v)
with open(k + '.csv', 'w') as fobj:
wrtr = csv.writer(fobj)
wrtr.writerow(header)
for rec in v:
wrtr.writerow(rec)
trn = []
evl = []
tst = []
with open('Immunotherapy - ImmunoDataset.csv', 'r') as fobj:
rdr = csv.reader(fobj)
header = next(rdr)
label_key = header[-1]
feature_keys = header[:-1]
for rec in rdr:
# Output of random number generator determines
# which set the record will be placed.
rn = random.random()
if rn < 0.6:
trn.append(rec)
elif rn < 0.8:
evl.append(rec)
else:
tst.append(rec)
datasets = {
'train' : trn,
'eval' : evl,
'test' : tst
}
generate_sets(datasets)
TensorForest
Hyperparametersfhp = tforest.tensor_forest.ForestHParams(
num_classes=2,
num_features=7,
regression=False
)
fcs = [COLUMN_PROPS[k]['feature_column'] for k in FEATURE_COLUMNS]
TensorForestEstimator
对象tfe = tforest.random_forest.TensorForestEstimator(
fhp,
feature_columns=fcs,
report_feature_importances=True
)
pandas_input_fn
定义包装器def get_input_fn(csv_file):
df = pd.read_csv(csv_file)
features = df.loc[:,'sex':'induration_diameter']
# Workaround for this issue:
#
# https://stackoverflow.com/questions/48577372/tensorflowusing-pandas-input-fn-with-tensorforestestimator
# https://github.com/tensorflow/tensorflow/issues/16692
labels = pd.DataFrame(
np.expand_dims(
df.loc[:,'Result_of_Treatment'].values, axis=1
)
)
return pandas_input_fn(x=features, y=labels, shuffle=False)
tfe.fit(
input_fn=get_input_fn('train.csv')
)
经过进一步测试,我相信这是TensorForestEstimator
的一个错误。更多详细信息可以在此URL的GitHub Issue中找到: