使用 HuggingFace 预训练模型生成文档嵌入时出现张量大小错误

Question

我正在尝试使用 HuggingFace Transformer 库中预先训练的模型来获取文档嵌入。输入是文档，输出是使用预训练模型对该文档的嵌入。但我收到如下错误，不知道如何修复。

代码：

from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm

PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"

# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
    if filename.lower().endswith('.pdf'):
        pdf_path = os.path.join(DIR, filename)
        with fitz.open(pdf_path) as doc:
            text_content = ""
            for page in doc:
                text_content += page.get_text()
            text = text_content.split("PUBLIC CONSULTATION")[0]
            project_code = os.path.splitext(filename)[0]
            pdf_texts[project_code] = text 

# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
    embedding = pipe(text, return_tensors="pt")
    embeddings[project_code] = embedding[0][0].numpy()

错误：

错误发生在

embedding = pipe(text, return_tensors="pt")

行。输出如下：

Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1

输入文档：https://drive.google.com/file/d/17yFOR0dQ8UMbefFed5QPZUXqU0vzifUw/view?usp=sharing

谢谢！

Answer 1

text

变量的长度为 3619，而管道接受的最大长度为 512。您可以通过将文本拆分为 512 的块来解决此问题，也可以使用接受更大序列的模型。

使用 HuggingFace 预训练模型生成文档嵌入时出现张量大小错误

问题描述投票：0回答：1

1个回答

最新问题

使用 HuggingFace 预训练模型生成文档嵌入时出现张量大小错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1