用于批处理的 Pegasus 分词器( TypeError: TextEncodeInput 必须是 Union )

问题描述 投票:0回答:1

我正在尝试与一些变形金刚一起玩,以学习这个神奇世界的基础知识,但在最后几天,我陷入了飞马座模型。我正在尝试总结数据集中的文本特征,以便使用 Pegasus 作为摘要模型,为 Bert 分词器提供足够简短的摘要。当我使用带有 batched = False 的函数映射运行代码时,evrething 工作正常,但如果我转向 batched = True 我得到:

TypeError:TextEncodeInput 必须是 Union[TextInputSequence, 元组[输入序列,输入序列]]

我已经:

  • 检查是否没有值并删除
  • 检查我是否给出了字符串列表作为标记生成器的输入
  • 尝试从batch_samples生成字符串列表并将其作为输入

检查功能:

    for value in input_dict["text"]:
        if value is None:  # Check for None or NaN
            result_list.append("is None")
        elif value != value :
            result_list.append("is NaN")
        elif isinstance(value, str):
            result_list.append("is String")
        else: 
            result_list.append(False)

    print(result_list)


process_dictionary(batch_samples)

这是我的代码,你可以在这里找到数据集

from datasets import Dataset, DatasetDict, concatenate_datasets
import pandas as pd
from transformers import AutoModel, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def preprocess_data(df):
    #adding token_lenght column
    df["lb_num_token"] = d_len
    
    #Dropping Nan values
    df = df.dropna(subset=['case_text'])

    # Dropping unused features and renaming columns
    df = df.drop(columns =['case_id', 'case_title'])
    df.rename(columns={"case_text":"text", "case_outcome":"label"}, inplace= True)

    # Get the number of unique labels
    labels_list = df["label"].unique().tolist()
    
    # Splitting Dataset
    df = Dataset.from_pandas(df)
    df = df.map(lambda example: {'text': str(example['text'])})
    train_valid = df.train_test_split(test_size= 0.2, seed= 42)
    valid_test  = train_valid["test"].train_test_split(test_size= 0.5, seed= 42)
    
    df_split = DatasetDict({
    'train': train_valid['train'],
    'valid': valid_test['train'],
    'test': valid_test['test']
    })
    
    return df_split, labels_list

#Loading Dataset
df = pd.read_csv("./datasets/legal_text_classification.csv")

# number of bert token for each sample
model_ckpt = "nlpaueb/legal-bert-small-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)  
d_len = [len(tokenizer.encode(str(s))) for s in df["case_text"]]

# preprocessing dataset
df, labels_list = preprocess_data(df)

train = Dataset.from_dict(df["train"][0:9])

def pegasus_summary(batch_samples, model, tokenizer):
    # This function take in input a batch of samples and return the summary of each sample.
    # The summary length is set to 400 token length, because the output summary will be used as bert tokenizer input
    # LLM used: legal-pegasus
    # It will be better to call this function with model anf tokenizer already define inside the main code

    summary = ""
    # summary
    input_tokenized = tokenizer.encode(batch_samples["text"], return_tensors='pt', max_length=1024, truncation=True).to(device)
    with torch.no_grad():
        summary_ids = model.generate(input_tokenized,
                                     num_beams=9,
                                     no_repeat_ngram_size=3,
                                     length_penalty=2.0,
                                     min_length=150,
                                     max_length=400,
                                     early_stopping=True)

    summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
    return {"text": summary}

def summarizing_samples(df):
    model_ckpt_sum = "nsi319/legal-pegasus"
    tokenizer_sum = AutoTokenizer.from_pretrained(model_ckpt_sum)
    model_sum = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt_sum).to(device)
    
    df_long = df.filter(lambda example: example["lb_num_token"] > 512)
    df_short= df.filter(lambda example: example["lb_num_token"] <= 512)

    df_long = df_long.map(lambda example: pegasus_summary(example, model_sum, tokenizer_sum), batched = True)
                                                                                          
    df = concatenate_datasets([df_long, df_short])
    return df

device = "cuda" if torch.cuda.is_available() else "cpu"
train = summarizing_samples(train)

for it in train["text"]:
    print(it, "\n\n\n")

非常感谢您的宝贵时间,我希望我的英语是可以理解的。

python pytorch nlp batch-processing huggingface-transformers
1个回答
0
投票

错误

代码产生错误,因为 .encode() 函数要求输入文本参数为文本类型(str、List[str] 或 List[int])。在此特定实例中,作为 List[str] 提供的输入不是文本值列表;而是文本值列表。相反,它是一个包含单个已标记化句子的列表。另一方面,当使用 List[int] 时,该函数需要一系列标记化字符串 ID。

解决方案

我刚刚将“标记化”函数修改为:

 input_tokenized = tokenizer.(batch_samples["text"], return_tensors='pt', max_length=1024, truncation=True).to(device)

然后我修改了:

summary_ids = model.generate(input_ids = input_tokenized["input_ids"].to(device),
                                     attention_mask= input_tokenized["attention_mask"].to(device),
                                     num_beams=7,
                                     no_repeat_ngram_size=3,
                                     length_penalty=2.0,
                                     max_length=128,
                                     early_stopping=True)
© www.soinside.com 2019 - 2024. All rights reserved.