我正在尝试与一些变形金刚一起玩,以学习这个神奇世界的基础知识,但在最后几天,我陷入了飞马座模型。我正在尝试总结数据集中的文本特征,以便使用 Pegasus 作为摘要模型,为 Bert 分词器提供足够简短的摘要。当我使用带有 batched = False 的函数映射运行代码时,evrething 工作正常,但如果我转向 batched = True 我得到:
TypeError:TextEncodeInput 必须是 Union[TextInputSequence, 元组[输入序列,输入序列]]
我已经:
检查功能:
for value in input_dict["text"]:
if value is None: # Check for None or NaN
result_list.append("is None")
elif value != value :
result_list.append("is NaN")
elif isinstance(value, str):
result_list.append("is String")
else:
result_list.append(False)
print(result_list)
process_dictionary(batch_samples)
这是我的代码,你可以在这里找到数据集
from datasets import Dataset, DatasetDict, concatenate_datasets
import pandas as pd
from transformers import AutoModel, AutoTokenizer, AutoModelForSeq2SeqLM
import torch
def preprocess_data(df):
#adding token_lenght column
df["lb_num_token"] = d_len
#Dropping Nan values
df = df.dropna(subset=['case_text'])
# Dropping unused features and renaming columns
df = df.drop(columns =['case_id', 'case_title'])
df.rename(columns={"case_text":"text", "case_outcome":"label"}, inplace= True)
# Get the number of unique labels
labels_list = df["label"].unique().tolist()
# Splitting Dataset
df = Dataset.from_pandas(df)
df = df.map(lambda example: {'text': str(example['text'])})
train_valid = df.train_test_split(test_size= 0.2, seed= 42)
valid_test = train_valid["test"].train_test_split(test_size= 0.5, seed= 42)
df_split = DatasetDict({
'train': train_valid['train'],
'valid': valid_test['train'],
'test': valid_test['test']
})
return df_split, labels_list
#Loading Dataset
df = pd.read_csv("./datasets/legal_text_classification.csv")
# number of bert token for each sample
model_ckpt = "nlpaueb/legal-bert-small-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
d_len = [len(tokenizer.encode(str(s))) for s in df["case_text"]]
# preprocessing dataset
df, labels_list = preprocess_data(df)
train = Dataset.from_dict(df["train"][0:9])
def pegasus_summary(batch_samples, model, tokenizer):
# This function take in input a batch of samples and return the summary of each sample.
# The summary length is set to 400 token length, because the output summary will be used as bert tokenizer input
# LLM used: legal-pegasus
# It will be better to call this function with model anf tokenizer already define inside the main code
summary = ""
# summary
input_tokenized = tokenizer.encode(batch_samples["text"], return_tensors='pt', max_length=1024, truncation=True).to(device)
with torch.no_grad():
summary_ids = model.generate(input_tokenized,
num_beams=9,
no_repeat_ngram_size=3,
length_penalty=2.0,
min_length=150,
max_length=400,
early_stopping=True)
summary = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
return {"text": summary}
def summarizing_samples(df):
model_ckpt_sum = "nsi319/legal-pegasus"
tokenizer_sum = AutoTokenizer.from_pretrained(model_ckpt_sum)
model_sum = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt_sum).to(device)
df_long = df.filter(lambda example: example["lb_num_token"] > 512)
df_short= df.filter(lambda example: example["lb_num_token"] <= 512)
df_long = df_long.map(lambda example: pegasus_summary(example, model_sum, tokenizer_sum), batched = True)
df = concatenate_datasets([df_long, df_short])
return df
device = "cuda" if torch.cuda.is_available() else "cpu"
train = summarizing_samples(train)
for it in train["text"]:
print(it, "\n\n\n")
非常感谢您的宝贵时间,我希望我的英语是可以理解的。
错误
代码产生错误,因为 .encode() 函数要求输入文本参数为文本类型(str、List[str] 或 List[int])。在此特定实例中,作为 List[str] 提供的输入不是文本值列表;而是文本值列表。相反,它是一个包含单个已标记化句子的列表。另一方面,当使用 List[int] 时,该函数需要一系列标记化字符串 ID。
解决方案
我刚刚将“标记化”函数修改为:
input_tokenized = tokenizer.(batch_samples["text"], return_tensors='pt', max_length=1024, truncation=True).to(device)
然后我修改了:
summary_ids = model.generate(input_ids = input_tokenized["input_ids"].to(device),
attention_mask= input_tokenized["attention_mask"].to(device),
num_beams=7,
no_repeat_ngram_size=3,
length_penalty=2.0,
max_length=128,
early_stopping=True)