如何合并数值模型和嵌入序列模型来处理 RNN 中的类别

问题描述 投票:0回答:2

我想为我的分类特征构建一个带有嵌入的单层 LSTM 模型。我目前有数字特征和一些分类特征,例如位置,它不能进行单热编码,例如由于计算复杂性,使用

pd.get_dummies()
,这正是我最初打算做的。

让我们想象一个例子:

样本数据

data = {
    'user_id': [1,1,1,1,2,2,3],
    'time_on_page': [10,20,30,20,15,10,40],
    'location': ['London','New York', 'London', 'New York', 'Hong Kong', 'Tokyo', 'Madrid'],
    'page_id': [5,4,2,1,6,8,2]
}
d = pd.DataFrame(data=data)
print(d)
   user_id  time_on_page   location  page_id
0        1            10     London        5
1        1            20   New York        4
2        1            30     London        2
3        1            20   New York        1
4        2            15  Hong Kong        6
5        2            10      Tokyo        8
6        3            40     Madrid        2

让我们看看访问网站的人。我正在跟踪数字数据,例如页面停留时间等。分类数据包括:位置(超过 1000 个唯一值)、Page_id(> 1000 个唯一值)、Author_id(超过 100 个唯一值)。最简单的解决方案是对所有内容进行 one-hot 编码,并将其放入具有可变序列长度的 LSTM 中,每个时间步对应于不同的页面视图。

上面的DataFrame将生成7个训练样本,序列长度可变。例如,对于

user_id=2
我将有 2 个训练样本:

[ ROW_INDEX_4 ] and [ ROW_INDEX_4, ROW_INDEX_5 ]

X
作为训练数据,让我们看看第一个训练样本
X[0]

从上图中,我的分类特征是

X[0][:, n:]

在创建序列之前,我使用

[0,1... number_of_cats-1]
将分类变量分解为
pd.factorize()
,因此
X[0][:, n:]
中的数据是与其索引相对应的数字。

我需要为每个分类特征分别创建一个

Embedding
吗?例如。每个
x_*n, x_*n+1, ..., x_*m
?

的嵌入

如果是这样,我如何将其放入 Keras 代码中?

model = Sequential()

model.add(Embedding(?, ?, input_length=variable)) # How do I feed the data into this embedding? Only the categorical inputs.

model.add(LSTM())
model.add(Dense())
model.add.Activation('sigmoid')
model.compile()

model.fit_generator() # fits the `X[i]` one by one of variable length sequences.

我的解决思路:

看起来像这样的东西:

我可以在每个分类特征 (m-n) 上训练 Word2Vec 模型,以向量化任何给定值。例如。伦敦将在 3 个维度上进行矢量化。假设我使用 3 维嵌入。然后我将所有内容放回到 X 矩阵中,该矩阵现在将有 n + 3(n-m),并使用 LSTM 模型来训练它?

我只是认为应该有一种更简单/更聪明的方法。

python tensorflow machine-learning keras lstm
2个回答
17
投票
正如您提到的,一种解决方案是对分类数据进行一次性编码(或者甚至按原样使用它们,以基于索引的格式),并将它们与数值数据一起馈送到 LSTM 层。当然,你也可以在这里有两个 LSTM 层,一个用于处理数值数据,另一个用于处理分类数据(采用 one-hot 编码格式或基于索引的格式),然后合并它们的输出。

另一种解决方案是为每个分类数据使用一个单独的嵌入层。每个嵌入层可能有自己的嵌入维度(正如上面所建议的,您可能有多个 LSTM 层来分别处理数值和分类特征):

num_cats = 3 # number of categorical features n_steps = 100 # number of timesteps in each sample n_numerical_feats = 10 # number of numerical features in each sample cat_size = [1000, 500, 100] # number of categories in each categorical feature cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input') cat_inputs = [] for i in range(num_cats): cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input')) cat_embedded = [] for i in range(num_cats): embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i]) cat_embedded.append(embed) cat_merged = concatenate(cat_embedded) cat_merged = Reshape((n_steps, -1))(cat_merged) merged = concatenate([numerical_input, cat_merged]) lstm_out = LSTM(64)(merged) model = Model([numerical_input] + cat_inputs, lstm_out) model.summary()
以下是模型摘要:

Layer (type) Output Shape Param # Connected to ================================================================================================== cat1_input (InputLayer) (None, 100, 1) 0 __________________________________________________________________________________________________ cat2_input (InputLayer) (None, 100, 1) 0 __________________________________________________________________________________________________ cat3_input (InputLayer) (None, 100, 1) 0 __________________________________________________________________________________________________ time_distributed_1 (TimeDistrib (None, 100, 1, 50) 50000 cat1_input[0][0] __________________________________________________________________________________________________ time_distributed_2 (TimeDistrib (None, 100, 1, 10) 5000 cat2_input[0][0] __________________________________________________________________________________________________ time_distributed_3 (TimeDistrib (None, 100, 1, 100) 10000 cat3_input[0][0] __________________________________________________________________________________________________ concatenate_1 (Concatenate) (None, 100, 1, 160) 0 time_distributed_1[0][0] time_distributed_2[0][0] time_distributed_3[0][0] __________________________________________________________________________________________________ numeric_input (InputLayer) (None, 100, 10) 0 __________________________________________________________________________________________________ reshape_1 (Reshape) (None, 100, 160) 0 concatenate_1[0][0] __________________________________________________________________________________________________ concatenate_2 (Concatenate) (None, 100, 170) 0 numeric_input[0][0] reshape_1[0][0] __________________________________________________________________________________________________ lstm_1 (LSTM) (None, 64) 60160 concatenate_2[0][0] ================================================================================================== Total params: 125,160 Trainable params: 125,160 Non-trainable params: 0 __________________________________________________________________________________________________
但是您可以尝试另一种解决方案:只需为所有分类特征使用一个嵌入层。不过,它涉及一些预处理:您需要重新索引所有类别以使它们彼此不同。例如,第一个分类特征中的类别将从 1 到 

size_first_cat

 编号,然后第二个分类特征中的类别将从 
size_first_cat + 1
size_first_cat + size_second_cat
 编号,依此类推。然而,在此解决方案中,所有分类特征都将具有相同的嵌入维度,因为我们仅使用一个嵌入层。


更新:现在我想起来,你还可以在数据预处理阶段甚至模型中重塑分类特征,以摆脱TimeDistributed

层和
Reshape
层(这可能会提高训练速度)以及):

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input') cat_inputs = [] for i in range(num_cats): cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input')) cat_embedded = [] for i in range(num_cats): embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i]) cat_embedded.append(embed) cat_merged = concatenate(cat_embedded) merged = concatenate([numerical_input, cat_merged]) lstm_out = LSTM(64)(merged) model = Model([numerical_input] + cat_inputs, lstm_out)
型号总结:

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== cat1_input (InputLayer) [(None, 100)] 0 [] cat2_input (InputLayer) [(None, 100)] 0 [] cat3_input (InputLayer) [(None, 100)] 0 [] embedding_14 (Embedding) (None, 100, 50) 50000 ['cat1_input[0][0]'] embedding_15 (Embedding) (None, 100, 10) 5000 ['cat2_input[0][0]'] embedding_16 (Embedding) (None, 100, 100) 10000 ['cat3_input[0][0]'] numeric_input (InputLayer) [(None, 100, 10)] 0 [] concatenate_26 (Concatenat (None, 100, 160) 0 ['embedding_14[0][0]', e) 'embedding_15[0][0]', 'embedding_16[0][0]'] concatenate_27 (Concatenat (None, 100, 170) 0 ['numeric_input[0][0]', e) 'concatenate_26[0][0]'] lstm_5 (LSTM) (None, 64) 60160 ['concatenate_27[0][0]'] ================================================================================================== Total params: 125160 (488.91 KB) Trainable params: 125160 (488.91 KB) Non-trainable params: 0 (0.00 Byte) __________________________________________________________________________________________________
至于拟合模型,您需要分别为每个输入层提供其自己对应的numpy数组,例如:

X_tr_numerical = X_train[:,:,:n_numerical_feats] # extract categorical features: you can use a for loop to this as well. # note that we reshape categorical features to make them consistent with the updated solution X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps) X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps) X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps) # don't forget to compile the model ... # fit the model model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...) # or you can use input layer names instead model.fit({'numeric_input': X_tr_numerical, 'cat1_input': X_tr_cat1, 'cat2_input': X_tr_cat2, 'cat3_input': X_tr_cat3}, y_train, ...)
如果您想使用

fit_generator()

,没有区别:

# if you are using a generator def my_generator(...): # prep the data ... yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y # or use the names yield {'numeric_input': batch_tr_numerical, 'cat1_input': batch_tr_cat1, 'cat2_input': batch_tr_cat2, 'cat3_input': batch_tr_cat3}, batch_tr_y model.fit_generator(my_generator(...), ...) # or if you are subclassing Sequence class class MySequnece(Sequence): def __init__(self, x_set, y_set, batch_size): # initialize the data def __getitem__(self, idx): # fetch data for the given batch index (i.e. idx) # same as the generator above but use `return` instead of `yield` model.fit_generator(MySequence(...), ...)
    

0
投票
我能想到的另一个解决方案是,您甚至可以在将其提供给 lstm 之前将数值(标准化后)和分类特征连接在一起。

在反向传播期间,梯度仅在嵌入层中流动,因为默认情况下梯度将在两个分支中流动。

© www.soinside.com 2019 - 2024. All rights reserved.