对多个文件进行标记化 python

问题描述 投票:0回答:1

我目前正在尝试对大文本进行标记,但是我想要对目录中的很多文件进行标记,因为执行 1 by 1 操作非常耗时。

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch


tokenizer = AutoTokenizer.from_pretrained("joeddav/distilbert-base-uncased-go-emotions-student")
model = AutoModelForSequenceClassification.from_pretrained("joeddav/distilbert-base-uncased-go-emotions-student")

txt="...."

words_input_dir = "/content/sample_data/"

for filename in os.listdir(words_input_dir):
    if filename.endswith(".txt"):
        with open(filename, "r") as input_file:
            input_tokens = word_tokensize(input_file.read())

tokens = tokenizer.encode_plus(input_file.read(), add_special_tokens = False, return_tensors = 'pt')

print(len(tokens))

代币

在我添加循环之前阅读原始内容

tokens = tokenizer.encode_plus(txt, add_special_tokens = False, return_tensors = 'pt')

问候,

我尝试循环标记函数,但它似乎只采用特定的打印文本。

python nlp token huggingface-transformers huggingface-tokenizers
1个回答
0
投票

我无法测试它,但对于我来说,在循环内你应该将

input_file.read()
附加到
txt=
并在循环后使用
tokenizer.encode_plus()
word_tokensize()

# --- before loop ---

txt = ""  # empty string at start

# --- loop ---

for filename in os.listdir(words_input_dir):
    if filename.endswith(".txt"):
        with open(filename, "r") as input_file:
           txt += input_file.read() + "\n"  # append text from file

# --- after loop ---

input_tokens = word_tokensize(txt)

tokens = tokenizer.encode_plus(txt, add_special_tokens=False, return_tensors='pt')

print(len(tokens))
© www.soinside.com 2019 - 2024. All rights reserved.