我目前正在尝试对大文本进行标记,但是我想要对目录中的很多文件进行标记,因为执行 1 by 1 操作非常耗时。
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("joeddav/distilbert-base-uncased-go-emotions-student")
model = AutoModelForSequenceClassification.from_pretrained("joeddav/distilbert-base-uncased-go-emotions-student")
txt="...."
words_input_dir = "/content/sample_data/"
for filename in os.listdir(words_input_dir):
if filename.endswith(".txt"):
with open(filename, "r") as input_file:
input_tokens = word_tokensize(input_file.read())
tokens = tokenizer.encode_plus(input_file.read(), add_special_tokens = False, return_tensors = 'pt')
print(len(tokens))
代币
在我添加循环之前阅读原始内容
tokens = tokenizer.encode_plus(txt, add_special_tokens = False, return_tensors = 'pt')
问候,
我尝试循环标记函数,但它似乎只采用特定的打印文本。
我无法测试它,但对于我来说,在循环内你应该将
input_file.read()
附加到 txt=
并在循环后使用 tokenizer.encode_plus()
和 word_tokensize()
# --- before loop ---
txt = "" # empty string at start
# --- loop ---
for filename in os.listdir(words_input_dir):
if filename.endswith(".txt"):
with open(filename, "r") as input_file:
txt += input_file.read() + "\n" # append text from file
# --- after loop ---
input_tokens = word_tokensize(txt)
tokens = tokenizer.encode_plus(txt, add_special_tokens=False, return_tensors='pt')
print(len(tokens))