我想对Reuters 50 50数据集进行作者分类,其中最大令牌长度为1600+令牌,总共有50个班级/作者。
使用max_length=1700
和batch_size=1
,我得到的是RuntimeError: CUDA out of memory
。可以通过设置max_length=512
来防止此错误,但这会导致文本截断的不良影响。
标记和编码:
from keras.preprocessing.sequence import pad_sequences
MAX_LEN = 1700
def get_encodings(texts):
token_ids = []
attention_masks = []
for text in texts:
token_id = tokenizer.encode(text, add_special_tokens=True, max_length=MAX_LEN)
token_ids.append(token_id)
return token_ids
def pad_encodings(encodings):
return pad_sequences(encodings, maxlen=MAX_LEN, dtype="long",
value=0, truncating="post", padding="post")
def get_attention_masks(padded_encodings):
attention_masks = []
for encoding in padded_encodings:
attention_mask = [int(token_id > 0) for token_id in encoding]
attention_masks.append(attention_mask)
return attention_masks
train_encodings = get_encodings(train_df.text.values)
train_encodings = pad_encodings(train_encodings)
train_attention_masks = get_attention_masks(train_encodings)
test_encodings = get_encodings(test_df.text.values)
test_encodings = pad_encodings(test_encodings)
test_attention_masks = get_attention_masks(test_encodings)
打包到数据集和数据加载器:
X_train = torch.tensor(train_encodings)
y_train = torch.tensor(train_df.author_id.values)
train_masks = torch.tensor(train_attention_masks)
X_test = torch.tensor(test_encodings)
y_test = torch.tensor(test_df.author_id.values)
test_masks = torch.tensor(test_attention_masks)
batch_size = 1
# Create the DataLoader for our training set.
train_data = TensorDataset(X_train, train_masks, y_train)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(X_test, test_masks, y_test)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
模型设置:
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
config = BertConfig.from_pretrained(
'bert-base-uncased',
num_labels = 50,
output_attentions = False,
output_hidden_states = False,
max_position_embeddings=MAX_LEN
)
model = BertForSequenceClassification(config)
model.to(device)
optimizer = AdamW(model.parameters(),
lr = 2e-5,
eps = 1e-8
)
培训:
for epoch_i in range(0, epochs):
model.train()
for step, batch in enumerate(train_dataloader):
b_texts = batch[0].to(device)
b_attention_masks = batch[1].to(device)
b_authors = batch[2].to(device)
model.zero_grad()
outputs = model(b_texts,
token_type_ids=None,
attention_mask=b_attention_masks,
labels=b_authors) <------- ERROR HERE
错误:
RuntimeError: CUDA out of memory. Tried to allocate 6.00 GiB (GPU 0; 7.93 GiB total capacity; 1.96 GiB already allocated; 5.43 GiB free; 536.50 KiB cached)
除非您正在使用TPU进行培训,否则现在拥有足够的GPU RAM与任何可用的GPU的机会非常低。对于某些BERT模型,仅该模型就需要占用10GB以上的RAM,而将序列长度加倍超过512个令牌则需要更多的内存。作为参考,具有24 GB GPU RAM(目前大多数可用于单个GPU的Titan RTX)几乎无法同时容纳24个512个令牌的样本。
幸运的是,大多数网络在截取样本时仍会产生非常不错的性能,但这当然是特定于任务的。还要记住-除非您是从头开始进行培训,否则所有预先训练的模型通常都按照512个令牌限制进行训练。据我所知,当前唯一支持更长序列的模型是Bart
,它允许最多1024个令牌。