我正在处理包含大约 920,614 行和多列的数据集,包括“orig_item_title”、“sub_item_title”、“is_brand_same”和“is_flavor_same”。目标是建立一个模型来预测项目的相似性或相关性,特别是替代项目是否与原始项目相似。我正在实现一个学习排名 (LTR) 框架,将品牌和风味匹配等功能以及来自文本列的 BERT 编码器的嵌入结合在一起。
以下是用于预处理特征和创建 TensorFlow 数据集的代码片段:
from transformer import BertTokenizer, TFBertModel
import tensorflow as tf
#load a pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name) # Embedding size is 768
# Define a function for BERT encoding
def bert_encoder(text_column):
input_ids = tokenizer(text_column, return_tensors="tf", truncation=True, padding=True)["input_ids"]
outputs = bert_model(input_ids)
pooled_output = outputs.pooler_output
return pooled_output
def preprocess_features(df):
# Extract features and labels
text_columns = ["orig_item_title", "sub_item_title"]
numerical_columns = ["is_brand_same", 'is_flavor_same']
label_column = "acc_rate"
features = {
"orig_item_title": df["orig_item_title"],
"sub_item_title": df["sub_item_title"],
"is_brand_same": df["is_brand_same"],
'is_flavor_same': df['is_flavor_same'],
}
for col in text_columns:
features[col] = bert_encoder(df[col].tolist())
# Numerical columns
numerical_features = [tf.feature_column.numeric_column(col) for col in numerical_columns]
features.update({col: df[col] for col in numerical_columns})
return features, df[label_column]
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df))
但是,在执行预处理时遇到内存不足(OOM)错误,因为它正在处理形状张量 [920614,55,768]。我正在寻求有关减少嵌入维度(可能减少到 256 或 128)的建议,并探索替代方法以使预处理成功而不耗尽内存。任何建议或代码指导都会非常有帮助。
另外,有人可以提供模型编码方面的帮助吗,包括集成文本特征的 BERT 层嵌入、与数字特征的串联,以及在输出层中包含具有 sigmoid 预测的神经网络层?
谢谢你。
我建议您执行批处理,如下面的代码所示。不要将整个数据集加载到内存中,而是分块加载和处理数据。 您可以使用 TensorFlow 的数据集 API 来实现此目的,该 API 旨在高效处理大型数据集。
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name)
# Define a function for BERT encoding with dimensionality reduction
def bert_encoder(text_column, batch_size=32, embed_dim=256):
# Dataset for efficient batch processing
dataset = tf.data.Dataset.from_tensor_slices(text_column).batch(batch_size)
embeddings = []
for batch in dataset:
input_ids = tokenizer(batch.numpy().tolist(), return_tensors="tf", padding=True, truncation=True)["input_ids"]
outputs = bert_model(input_ids)
pooled_output = outputs.pooler_output
# Dimensionality reduction
dense_layer = tf.keras.layers.Dense(embed_dim, activation='relu')
reduced_output = dense_layer(pooled_output)
embeddings.append(reduced_output)
return tf.concat(embeddings, axis=0)
def preprocess_features(df, batch_size=32, embed_dim=256):
# Process text columns in batches
df["orig_item_title_emb"] = bert_encoder(df["orig_item_title"].tolist(), batch_size, embed_dim)
df["sub_item_title_emb"] = bert_encoder(df["sub_item_title"].tolist(), batch_size, embed_dim)
# Other features and labels
# ...
# Combine all features
# ...
return features, df["acc_rate"]
# Assuming df is your DataFrame
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df)).batch(some_batch_size)