微调 GPT2 - 注意掩码和 pad 令牌 id 错误

Question

我一直在尝试在 wikitext-2 数据集上微调 GPT2（只是为了帮助自己学习这个过程），但我遇到了一条我以前从未见过的警告消息：

“未设置注意掩码和 pad 令牌 ID。因此，您可能会观察到意外行为。请传递您的输入

attention_mask

以获得可靠的结果。将

pad_token_id

设置为

eos_token_id

:50256 用于开放式生成。”

这看起来很奇怪，因为我在实例化分词器时在代码中明确指定了 EOS 代币：

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

训练完成而没有崩溃，我的损失在每个时期都有所改善，但是当我推断模型时，它输出绝对的乱码——有时只生成一个单词而没有其他。我认为我收到的这条警告消息与模型表现不佳之间存在联系。

我从这里得到了我的训练、有效、测试数据（我使用了.raw文件）-https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

我在数据集的原始 txt 文件中手动添加了 <|startoftext|> 和 <|endoftext|>。生成看起来像这两个示例的训练数据（取自文本文件的中间部分）：

...
<|startoftext|>
= Perfect Dark ( 2010 video game ) = 
 
 Perfect Dark is a remastered release of the first @-@ person shooter video game by the same name . Developed by 4J Studios and published by Microsoft Game Studios a decade after the original 's 2000 release , the remaster features several technical improvements , including higher resolution textures and models , a higher frame rate , and a multiplayer mode that supports the Xbox Live online service . It was released for the Xbox 360 video game console in March 2010 , through the Xbox Live Arcade download service . The story of the game follows Joanna Dark , an agent of the Carrington Institute organization , as she attempts to stop a conspiracy by rival corporation dataDyne . 
 Perfect Dark was under development for nearly a year and its game engine was completely re @-@ written from scratch to support several Xbox 360 features . Therefore , although the game plays exactly the same as the original , the code and renderer is different . The game received generally favorable reviews . Some critics considered the relatively unchanged game to be outdated , but most agreed that the title was a solid revival of a classic . As of the end of 2011 , the game had sold nearly 410 @,@ 000 units . 
 
 = = Gameplay = = 
 
 Perfect Dark is a first @-@ person shooter with elements of stealth games . In the game 's campaign mode , the player controls Joanna Dark through a series of nonlinear levels collected together into missions . Each level requires the player to complete a certain number of objectives , ranging from disguising oneself to hacking computers , collecting objects , and defeating enemies , among others . Players can carry an unlimited number of weapons and almost all of the weapons have two firing modes . The levels in Perfect Dark have no checkpoints , meaning that if Joanna is killed or fails an objective , the player has to start the level from the beginning . Every level can be played on three difficulty settings and several aspects , such as the enemies aggressiveness and the number of objectives that must be completed , among others , can vary in function of the chosen difficulty . Two players can also play the campaign co @-@ operatively or through a " counter @-@ operative " mode , in which one player controls the protagonist , while the other controls enemies throughout the level , attempting to stop the first player from completing objectives . 
 
 = = = Enhancements = = = 
 
 The remaster offers several improvements over the original Perfect Dark that was released for the Nintendo 64 in 2000 . The most remarkable change is that any of the multiplayer modes , including co @-@ operative and counter @-@ operative , can now be played in either splitscreen or through the Xbox Live online service . Combat Simulator matches are still capped at 12 entities , but the game can now comprise eight players online simultaneously , an improvement to the original 's cap of four players and eight Simulants . Players can also play against more than eight Simulants as long as there are enough slots available in a match ; for example , a single player can play against 11 Simulants ; such a feature was not possible in the original game . Unlike the original game , all the multiplayer content is unlocked from the beginning , and weapons from the game 's predecessor , which were originally only available in the missions , are now available to use in multiplayer . The game features an online leaderboard system and players can earn achievements and in @-@ game crowns by accomplishing certain tasks . The game also includes two new control set @-@ ups , entitled " Spartan " and " Duty Calls " , which are based on the popular first @-@ person shooter franchises Halo and Call of Duty respectively . 
 
 <|endoftext|>
<|startoftext|>
 = First Ostend Raid = 
 
 The First Ostend Raid ( part of Operation ZO ) was the first of two attacks by the Royal Navy on the German @-@ held port of Ostend during the late spring of 1918 during the First World War . Ostend was attacked in conjunction with the neighbouring harbour of Zeebrugge on 23 April in order to block the vital strategic port of Bruges , situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland and ideally sited to conduct raiding operations on the British coastline and shipping lanes . Bruges and its satellite ports were a vital part of the German plans in their war on Allied commerce ( Handelskrieg ) because Bruges was close to the troopship lanes across the English Channel and allowed much quicker access to the Western Approaches for the U @-@ boat fleet than their bases in Germany . 
 The plan of attack was for the British raiding force to sink two obsolete cruisers in the canal mouth at Ostend and three at Zeebrugge , thus preventing raiding ships leaving Bruges . The Ostend canal was the smaller and narrower of the two channels giving access to Bruges and so was considered a secondary target behind the Zeebrugge Raid . Consequently , fewer resources were provided to the force assaulting Ostend . While the attack at Zeebrugge garnered some limited success , the assault on Ostend was a complete failure . The German marines who defended the port had taken careful preparations and drove the British assault ships astray , forcing the abortion of the operation at the final stage . 
 Three weeks after the failure of the operation , a second attack was launched which proved more successful in sinking a blockship at the entrance to the canal but ultimately did not close off Bruges completely . Further plans to attack Ostend came to nothing during the summer of 1918 , and the threat from Bruges would not be finally stopped until the last days of the war , when the town was liberated by Allied land forces . 
 
 = = Bruges = = 
 
 Bruges had been captured by the advancing German divisions during the Race for the Sea and had been rapidly identified as an important strategic asset by the German Navy . Bruges was situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland at the centre of a network of canals which emptied into the sea at the small coastal towns of Zeebrugge and Ostend . This land barrier protected Bruges from bombardment by land or sea by all but the very largest calibre artillery and also secured it against raiding parties from the Royal Navy . Capitalising on the natural advantages of the port , the German Navy constructed extensive training and repair facilities at Bruges , equipped to provide support for several flotillas of destroyers , torpedo boats and U @-@ boats . 
 By 1916 , these raiding forces were causing serious concern in the Admiralty as the proximity of Bruges to the British coast , to the troopship lanes across the English Channel and for the U @-@ boats , to the Western Approaches ; the heaviest shipping lanes in the World at the time . In the late spring of 1915 , Admiral Reginald Bacon had attempted without success to destroy the lock gates at Ostend with monitors . This effort failed , and Bruges became increasingly important in the Atlantic Campaign , which reached its height in 1917 . By early 1918 , the Admiralty was seeking ever more radical solutions to the problems raised by unrestricted submarine warfare , including instructing the " Allied Naval and Marine Forces " department to plan attacks on U @-@ boat bases in Belgium . 
 The " Allied Naval and Marine Forces " was a newly formed department created with the purpose of conducting raids and operations along the coastline of German @-@ held territory . The organisation was able to command extensive resources from both the Royal and French navies and was commanded by Admiral Roger Keyes and his deputy , Commodore Hubert Lynes . Keyes , Lynes and their staff began planning methods of neutralising Bruges in late 1917 and by April 1918 were ready to put their plans into operation . 
 
 = = Planning = = 
 
 To block Bruges , Keyes and Lynes decided to conduct two raids on the ports through which Bruges had access to the sea . Zeebrugge was to be attacked by a large force consisting of three blockships and numerous supporting warships . Ostend was faced by a similar but smaller force under immediate command of Lynes . The plan was for two obsolete cruisers — HMS Sirius and Brilliant — to be expended in blocking the canal which emptied at Ostend . These ships would be stripped to essential fittings and their lower holds and ballast filled with rubble and concrete . This would make them ideal barriers to access if sunk in the correct channel at the correct angle . 
 When the weather was right , the force would cross the English Channel in darkness and attack shortly after midnight to coincide with the Zeebrugge Raid a few miles up the coast . By coordinating their operations , the assault forces would stretch the German defenders and hopefully gain the element of surprise . Covering the Inshore Squadron would be heavy bombardment from an offshore squadron of monitors and destroyers as well as artillery support from Royal Marine artillery near Ypres in Allied @-@ held Flanders . Closer support would be offered by several flotillas of motor launches , small torpedo boats and Coastal Motor Boats which would lay smoke screens to obscure the advancing blockships as well as evacuate the crews of the cruisers after they had blocked the channel . 

<|endoftext|> ...

我非常仔细地学习了本教程 - https://colab.research.google.com/drive/13dZVYEOMhXhkXWfvSMVM1TTtUDrT6Aeh?usp=sharing#scrollTo=pBEVY2PYSTXJ

这是我的完整代码：

import random
import time
import datetime
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup, GPT2Config

smallest_gpt2 = 'gpt2'  # 124M weights (parameters)

# load training texts
with open('wikitext-2-raw/wiki.train.raw', 'r') as o:
    raw_train_text = o.read()  # readlines() returns a list of strings separated by '\n'
with open('wikitext-2-raw/wiki.valid.raw', 'r') as o:
    raw_validation_text = o.read()
with open('wikitext-2-raw/wiki.test.raw', 'r') as o:
    raw_test_text = o.read()

# PRE-PROCESSING TRAINING, VALIDATION, AND TEST TEXTS
preprocessed_train = raw_train_text.split('<|startoftext|>')
preprocessed_train = [i for i in preprocessed_train if i]  # removes empty list entries
preprocessed_train = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_train]  # adds <|startoftext|> to start
preprocessed_valid = raw_validation_text.split('<|startoftext|>')
preprocessed_valid = [i for i in preprocessed_valid if i]
preprocessed_valid = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_valid]
preprocessed_test = raw_test_text.split('<|startoftext|>')
preprocessed_test = [i for i in preprocessed_test if i]
preprocessed_test = ['<|startoftext|>' + '\n' + entry for entry in preprocessed_test]

# HYPER PARAMETERS
EPOCHS = 5
BATCH_SIZE = 2  # GPT2 is a large model, so higher batch sizes can lead to memory problems
WARMUP_STEPS = 100
LEARNING_RATE = 5e-4
DECAY = 0
EPSILON = 1e-8


class GPT2Dataset(Dataset):

    def __init__(self, txt_list, _tokenizer, gpt2_type=smallest_gpt2, max_length=768):
        self.tokenizer = _tokenizer
        self.input_ids = []
        self.attn_masks = []

        # this loop will wrap all training data examples in BOS and EOS tokens (beginning/end of sequence)
        # this, again, helps the model understand the "format" of what you're training it for
        # note however, that if a training example is longer than the max length, the EOS token will be truncated, and
        #   this is not a problem for the model's training process
        for txt in txt_list:
            # pre_processed_text = '<|startoftext|>' + txt + '<|endoftext|>'  # i did this manually, so I skip it here
            # print(txt)

            # i handled most of the pre-processing for the training data further up in the code
            encodings_dict = _tokenizer(txt, truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]


# loading tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>',
                                          pad_token='<|pad|>')  # gpt2-medium

print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

# create dataset objects
train_dataset = GPT2Dataset(preprocessed_train, tokenizer, max_length=768)
valid_dataset = GPT2Dataset(preprocessed_valid, tokenizer, max_length=768)
test_dataset = GPT2Dataset(preprocessed_test, tokenizer, max_length=768)

# getting size of datasets
train_size = len(train_dataset)
val_size = len(valid_dataset)

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(  # todo learn how dataloader creates targets
            train_dataset,  # The training samples.
            sampler=RandomSampler(train_dataset),  # Select batches randomly
            batch_size=BATCH_SIZE  # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            valid_dataset,  # The validation samples.
            sampler=SequentialSampler(valid_dataset),  # Pull out batches sequentially.
            batch_size=BATCH_SIZE  # Evaluate with this batch size.
        )

# config
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate model
model = GPT2LMHeadModel.from_pretrained(smallest_gpt2, config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up. NOTE these tokens are already added to tokenizer above
model.resize_token_embeddings(len(tokenizer))

# this produces sample output every 50 steps
sample_every = 50

# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=EPSILON)

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * EPOCHS

# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=total_steps)

training_stats = []
total_t0 = time.time()

# device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)


def format_time(_elapsed):
    return str(datetime.timedelta(seconds=int(round(_elapsed))))


for epoch_i in range(0, EPOCHS):

    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, EPOCHS))
    print('Training...')

    t0 = time.time()

    total_train_loss = 0

    model.train()  # puts model in training mode

    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)  # training targets
        b_masks = batch[1].to(device)

        model.zero_grad()

        # feeding the input to the model
        outputs = model(b_input_ids,
                        labels=b_labels,
                        attention_mask=b_masks,
                        token_type_ids=None
                        )

        loss = outputs[0]  # how "wrong" was the model?

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches. This is just a check to see how the model is doing.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader),
                                                                                     batch_loss, elapsed))

            model.eval()  # puts model in evaluation mode, where the necessary layers are turned off for inference

            # normally you would use a context manager here so the gradients don't get modified during this inference. However the tutorial I follow does not do this.
            # with torch.no_grad():
            # ... do inference eval ...

            # Here we are simply using the model to get an output. This is called inference.
            sample_outputs = model.generate(
                bos_token_id=random.randint(1, 30000),  # todo why do we do this line?
                do_sample=True,  # switches on sampling, where model will randomly select next word from the sample pool
                top_k=50,  # only 50 words will be considered for the next word in the sequence
                max_length=200,  # max tokens for total generation
                top_p=0.95,  # smallest set of words whose probabilities summed together reach/exceed top_p value
                num_return_sequences=1  # we only want model to generate one complete response (sequence of words)
                # temperature=1
            )

            # temperature is another parameter we can use when running inference
            # temperature of 0 will choose the highest-probability word each time
            # temperature of 1 is default, and uses the model's base confidence to choose the next word
            # temperature above 1 will make the model choose less-likely words. More creative, but more risk of nonsense

            # we only sample for one return sequence so this for is sort of unnecessary, but whatever
            for i, sample_output in enumerate(sample_outputs):
                print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

            model.train()  # we have to put model back in train mode after eval mode

        loss.backward()  # change weights with backprop

        optimizer.step()

        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))

    # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        with torch.no_grad():  # weights are not updated
            outputs = model(b_input_ids,
                            # token_type_ids=None,
                            attention_mask=b_masks,
                            labels=b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time() - total_t0)))

Answer 1

我认为这与您的模型表现不佳无关，但要回答您的问题，警告与生成例程有关。

如 here 所述，这可以通过在对

pad_token_id

的调用中将

eos_token_id

简单地设置为分词器的

generate

来解决。它对我有用。

微调 GPT2 - 注意掩码和 pad 令牌 id 错误

问题描述投票：0回答：1

1个回答

最新问题

微调 GPT2 - 注意掩码和 pad 令牌 id 错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1