我需要为一堆不同语言模型的文档计算单词嵌入。没有问题,脚本做得很好,只是我在笔记本上工作,没有GPU,每个文本需要大约1.5s来处理,这太长了(我有成千上万的文本要处理)。
下面是我用pytorch和transformers lib做的。
import torch
from transformers import CamembertModel, CamembertTokenizer
docs = [text1, text2, ..., text20000]
tok = CamembertTokenizer.from_pretrained('camembert-base')
model = CamembertModel.from_pretrained('camembert-base', output_hidden_states=True)
# let try with a batch size of 64 documents
docids = [tok.encode(
doc, max_length=512, return_tensors='pt', pad_to_max_length=True) for doc in docs[:64]]
ids=torch.cat(tuple(docids))
device = 'cuda' if torch.cuda.is_available() else 'cpu' # cpu in my case...
model = model.to(device)
ids = ids.to(device)
model.eval()
with torch.no_grad():
out = model(input_ids=ids)
# 103s later...
谁有什么办法或建议来提高速度?