我目前有一个 data.tar.gz
我使用Python的shutil库提取。然后,我想。
some_regex
名称中,以 num
NaN
价值观tf.data.Dataset
train/eval
目前,我在Pandas+的工作管道中,有一个工作管道。tf.data
如下图所示,但我发现用 tf.Transform 做所有的预处理并将操作推送到图上的想法非常方便,而且还能享受到 tfdv
, tfma
所以我在寻找一种只使用TFX的方法。
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing as pp
features = ['fh', 'gh', 'th', 'fl']
label = 'force_cmp'
def _make_training_input_fn(data_dir, num, features, label, batch_size, epochs):
def input_fn():
path = os.path.join(data_dir, '**', '*some_regex{}'.format(str(num)))
filtered_paths = tf.io.gfile.glob(path)
original_df = pd.concat(map(lambda path: pd.read_csv(path, sep=';'), filtered_paths))
print("Original dataset has {} rows.".format(len(original_df)))
df = original_df.copy()
df = df.dropna()
df = df[(df['force_np'] > .02) & (df['force_np'] < .96)]
print("Transformed dataset has {} rows.".format(len(df)))
df = pd.DataFrame(pp.MinMaxScaler().fit_transform(df.values), columns=list(df.columns))
print("Normalizing the inputs.")
data_split = .8
msk = np.random.rand(len(df)) < data_split
print("Using a mask of {}.".format(msk))
traindf = df[msk]
train_features = traindf[features]
train_labels = traindf[[label]]
trainds = tf.data.Dataset.from_tensor_slices((train_features.values, train_labels.values))
trainds = trainds.shuffle(len(train_features))
trainds = trainds.batch(batch_size)
trainds = trainds.repeat(epochs)
trainds = trainds.prefetch(tf.data.experimental.AUTOTUNE)
return trainds
return input_fn
我正在寻找一个解决方案,只依靠TF函数,而不是一个。pd.DataFrame
也不是 np.numpy
. 首先,这甚至是相关的,或者我仍然应该使用Numpy和Pandas?
我已经研究了 tf.data
和TFX+光束,但我似乎找不到正确的解决方案。我是不是错过了什么,或者误解了这些库的精神?