使用 HuggingFace 预训练模型生成文档嵌入时出现张量大小错误

问题描述 投票:0回答:1

我正在尝试使用 HuggingFace Transformer 库中预先训练的模型来获取文档嵌入。输入是文档,输出是使用预训练模型对该文档的嵌入。但我收到如下错误,不知道如何修复。

代码:

from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm

PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"

# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
    if filename.lower().endswith('.pdf'):
        pdf_path = os.path.join(DIR, filename)
        with fitz.open(pdf_path) as doc:
            text_content = ""
            for page in doc:
                text_content += page.get_text()
            text = text_content.split("PUBLIC CONSULTATION")[0]
            project_code = os.path.splitext(filename)[0]
            pdf_texts[project_code] = text 

# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
    embedding = pipe(text, return_tensors="pt")
    embeddings[project_code] = embedding[0][0].numpy()

错误:

错误发生在

embedding = pipe(text, return_tensors="pt")
行。输出如下:

Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1

输入文档:https://drive.google.com/file/d/17yFOR0dQ8UMbefFed5QPZUXqU0vzifUw/view?usp=sharing

谢谢!

huggingface-transformers large-language-model word-embedding huggingface pre-trained-model
1个回答
0
投票

text
变量的长度为 3619,而管道接受的最大长度为 512。您可以通过将文本拆分为 512 的块来解决此问题,也可以使用接受更大序列的模型。

© www.soinside.com 2019 - 2024. All rights reserved.