用于 TensorFlow 中大型数据集预处理的内存高效 BERT 文本嵌入

问题描述 投票:0回答:1

我正在处理包含大约 920,614 行和多列的数据集,包括“orig_item_title”、“sub_item_title”、“is_brand_same”和“is_flavor_same”。目标是建立一个模型来预测项目的相似性或相关性,特别是替代项目是否与原始项目相似。我正在实现一个学习排名 (LTR) 框架,将品牌和风味匹配等功能以及来自文本列的 BERT 编码器的嵌入结合在一起。

以下是用于预处理特征和创建 TensorFlow 数据集的代码片段:

from transformer import BertTokenizer, TFBertModel
import tensorflow as tf

#load a pre-trained BERT model and tokenizer
 model_name = "bert-base-uncased"
 tokenizer = BertTokenizer.from_pretrained(model_name)
 bert_model = TFBertModel.from_pretrained(model_name)  # Embedding size is 768
    
# Define a function for BERT encoding
def bert_encoder(text_column):
    input_ids = tokenizer(text_column, return_tensors="tf", truncation=True, padding=True)["input_ids"]
    outputs = bert_model(input_ids)
    pooled_output = outputs.pooler_output
    return pooled_output

def preprocess_features(df):
    # Extract features and labels
    text_columns = ["orig_item_title", "sub_item_title"]
    numerical_columns = ["is_brand_same",  'is_flavor_same']
    label_column = "acc_rate"

    features = {
        "orig_item_title": df["orig_item_title"],
        "sub_item_title": df["sub_item_title"],
       "is_brand_same": df["is_brand_same"],
       'is_flavor_same': df['is_flavor_same'],
    }

    for col in text_columns:
        features[col] = bert_encoder(df[col].tolist())  

    # Numerical columns
    numerical_features = [tf.feature_column.numeric_column(col) for col in numerical_columns]
    features.update({col: df[col] for col in numerical_columns})

    return features, df[label_column]

dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df))

但是,在执行预处理时遇到内存不足(OOM)错误,因为它正在处理形状张量 [920614,55,768]。我正在寻求有关减少嵌入维度(可能减少到 256 或 128)的建议,并探索替代方法以使预处理成功而不耗尽内存。任何建议或代码指导都会非常有帮助。

另外,有人可以提供模型编码方面的帮助吗,包括集成文本特征的 BERT 层嵌入、与数字特征的串联,以及在输出层中包含具有 sigmoid 预测的神经网络层?

谢谢你。

tensorflow nlp huggingface-transformers bert-language-model embedding
1个回答
0
投票

我建议您执行批处理,如下面的代码所示。不要将整个数据集加载到内存中,而是分块加载和处理数据。 您可以使用 TensorFlow 的数据集 API 来实现此目的,该 API 旨在高效处理大型数据集。

from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = TFBertModel.from_pretrained(model_name)

# Define a function for BERT encoding with dimensionality reduction
def bert_encoder(text_column, batch_size=32, embed_dim=256):
    # Dataset for efficient batch processing
    dataset = tf.data.Dataset.from_tensor_slices(text_column).batch(batch_size)
    embeddings = []

    for batch in dataset:
        input_ids = tokenizer(batch.numpy().tolist(), return_tensors="tf", padding=True, truncation=True)["input_ids"]
        outputs = bert_model(input_ids)
        pooled_output = outputs.pooler_output
        # Dimensionality reduction
        dense_layer = tf.keras.layers.Dense(embed_dim, activation='relu')
        reduced_output = dense_layer(pooled_output)
        embeddings.append(reduced_output)

    return tf.concat(embeddings, axis=0)

def preprocess_features(df, batch_size=32, embed_dim=256):
    # Process text columns in batches
    df["orig_item_title_emb"] = bert_encoder(df["orig_item_title"].tolist(), batch_size, embed_dim)
    df["sub_item_title_emb"] = bert_encoder(df["sub_item_title"].tolist(), batch_size, embed_dim)
    
    # Other features and labels
    # ...

    # Combine all features
    # ...

    return features, df["acc_rate"]

# Assuming df is your DataFrame
dataset = tf.data.Dataset.from_tensor_slices(preprocess_features(df)).batch(some_batch_size)

© www.soinside.com 2019 - 2024. All rights reserved.