Keras NLP 微调数据集不兼容

Question

我正在 Google Colab 上运行 python 3.11 实例来微调 GPT-2 模型。我按如下方式安装

keras-nlp

，我的软件包版本变成：

!pip install keras_nlp
print(tf.__version__)
print(keras.__version__)
print(keras_nlp.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

>>2.16.1
>>3.3.3
>>0.11.1
>>1

下面我准备了一个可重现的最小示例，它生成了我得到的错误。顺便说一句，如果我按照如下方式设置环境（来自原始Google colab笔记本），下面的代码运行正常，这个版本的问题是它无法识别GPU：

!pip install -q git+https://github.com/keras-team/keras-nlp.git@google-io-2023 tensorflow-text==2.12
print(tf.__version__)
print(keras.__version__)
print(keras_nlp.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

>>2.12.1
>>2.12.0
>>0.5.0
>>0

所以，我认为这是一个版本兼容性问题。顶部配置（识别 GPU）运行良好，直到模型拟合，然后抛出下面的错误。

import numpy as np
import keras_nlp
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text as tf_text
from tensorflow import keras
from tensorflow.lite.python import interpreter
import time
from google.colab import files
from google.colab import runtime

gpt2_tokenizer = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en")
gpt2_preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=512,
    add_end_token=True,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset("gpt2_base_en", preprocessor=gpt2_preprocessor)

# Create dummy train data to highlight the error
training_list_manual = ['This is a sentence that I like', 'I went to school today.', 'I have a bike, you can ride it if you like.']
tf_train_ds = tf.data.Dataset.from_tensor_slices(training_list_manual)
processed_ds = tf_train_ds.map(gpt2_preprocessor, tf.data.AUTOTUNE).batch(64).cache().prefetch(tf.data.AUTOTUNE)

# Attempt fine-tune
gpt2_lm.include_preprocessing = False

num_epochs = 1

lr = tf.keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=part_of_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(lr),
    loss=loss,
    weighted_metrics=["accuracy"])

gpt2_lm.fit(processed_ds, epochs=num_epochs)

在最后的 .fit() 行中，我收到以下值错误。我知道它不喜欢我尝试输入的“x”值。但是我不知道用什么来代替它。有什么想法吗？

ValueError: Exception encountered when calling GPT2CausalLMPreprocessor.call().
Unsupported input for `x`. `x` should be a string, a list of strings, or a list of tensors. If passing multiple segments which should packed together, please convert your inputs to a list of tensors. Received `x={'token_ids': <tf.Tensor 'args_1:0' shape=(None, None, 1024) dtype=int32>, 'padding_mask': <tf.Tensor 'args_0:0' shape=(None, None, 1024) dtype=bool>}`

Arguments received by GPT2CausalLMPreprocessor.call():
  • x={'token_ids': 'tf.Tensor(shape=(None, None, 1024), dtype=int32)', 'padding_mask': 'tf.Tensor(shape=(None, None, 1024), dtype=bool)'}
  • y=tf.Tensor(shape=(None, None, 1024), dtype=int32)
  • sample_weight=tf.Tensor(shape=(None, None, 1024), dtype=bool)
  • sequence_length=None

一个数据项的内容如下所示。它是否期望我只将“token_ids”值提供给 .fit()？

for example in tf_train_ds.take(1):
    print(gpt2_preprocessor(example))
({'token_ids': <tf.Tensor: shape=(1, 1024), dtype=int32, numpy=array([[50256,  1212,   318, ...,     0,     0,     0]], dtype=int32)>, 'padding_mask': <tf.Tensor: shape=(1, 1024), dtype=bool, numpy=array([[ True,  True,  True, ..., False, False, False]])>}, <tf.Tensor: shape=(1, 1024), dtype=int32, numpy=array([[1212,  318,  257, ...,    0,    0,    0]], dtype=int32)>, <tf.Tensor: shape=(1, 1024), dtype=bool, numpy=array([[ True,  True,  True, ..., False, False, False]])>)

Answer 1

当我没有尝试事先预处理数据（

gpt2_lm.include_preprocessing = True

）并在拟合时批量处理 tf_train_ds（

gpt2_lm.fit(tf_train_ds.batch(64), epochs=num_epochs)

）时，它就起作用了：

# Attempt fine-tune
gpt2_lm.include_preprocessing = True

num_epochs = 1

lr = tf.keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=processed_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(lr),
    loss=loss,
    weighted_metrics=["accuracy"])

gpt2_lm.fit(tf_train_ds.batch(64), epochs=num_epochs)

Keras NLP 微调数据集不兼容

问题描述投票：0回答：1

1个回答

最新问题

Keras NLP 微调数据集不兼容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1