“enforce_stop_tokens”如何在 LangChain 中与 Huggingface 模型一起工作？

Question

当我们在

langchain

中查看HuggingFaceHub模型用法时，有这部分作者不知道如何停止生成，https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline .py#L182:

class HuggingFacePipeline(LLM):
        ...
    def _call(
        ...
        if stop is not None:
            # This is a bit hacky, but I can't figure out a better way to enforce
            # stop tokens when making calls to huggingface_hub.
            text = enforce_stop_tokens(text, stop)
        return text

我应该使用什么来将停止标记添加到模板的末尾？

如果我们看一下 https://github.com/hwchase17/langchain/blob/master/langchain/llms/utils.py，它只是一个正则表达式分割，根据停用词列表分割输入字符串，然后取

re.split

的第一个分区

re.split("|".join(stop), text)[0]

让我们尝试从 Huggingface 模型获取生成输出，例如

from transformers import pipeline
from transformers import GPT2LMHeadModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
output = generator("Hey Pizza! ")
output

[出]：

[{'generated_text': 'Hey Pizza! 」\n\n「Hurry up, leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and then, Yuigahama came in contact with Ruriko in the middle of the'}]

如果我们应用

re.split

：

import re
def enforce_stop_tokens(text, stop):
    """Cut off the text as soon as any stop words occur."""
    return re.split("|".join(stop), text)[0]

stop = ["up", "then"]
text = output[0]['generated_text']

re.split("|".join(stop), text)

[出]：

['Hey Pizza! 」\n\n「Hurry ',
 ', leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and ',
 ', Yuigahama came in contact with Ruriko in the middle of the']

但这没有用，我想在一代结束时分裂。 我使用什么令牌来“enforce_stop_tokens”？

Answer 1

您可以通过将 eos_token_id 设置为您的停止术语来做到这一点——在我的测试中，它似乎适用于列表。见下文：正则表达式切断停用词，eos_token_id 在停用词之后切断（“once Upon a time”与“once Upon a”）


from transformers import GPT2LMHeadModel, GPT2Tokenizer
import regex as re

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define your custom stop terms
stop_terms = [ "right", "time"]

# Ensure the stop terms are in the tokenizer's vocabulary
for term in stop_terms:
    if term not in tokenizer.get_vocab():
        tokenizer.add_tokens([term])
        model.resize_token_embeddings(len(tokenizer))

def enforce_stop_tokens(text, stop):
    """Cut off the text as soon as any stop words occur."""
    return re.split("|".join(stop), text)[0]

# Get the token IDs for your custom stop terms
eos_token_ids_custom = [tokenizer.encode(term, add_prefix_space=True)[0] for term in stop_terms]

# Generate text
input_text = "Once upon "
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output_ids = model.generate(input_ids, eos_token_id=eos_token_ids_custom, max_length=50)

# Decode the output IDs to text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(generated_text) # Once upon a time

print("ENFORCE STOP TOKENS")

truncated_text = enforce_stop_tokens(generated_text, stop_terms)

print(truncated_text) # Once upon a

“enforce_stop_tokens”如何在 LangChain 中与 Huggingface 模型一起工作？

问题描述投票：0回答：1

1个回答

最新问题

“enforce_stop_tokens”如何在 LangChain 中与 Huggingface 模型一起工作？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1