LangChaing 文本分割器和文档保存问题

问题描述 投票:0回答:1

我正在尝试使用 langchain 文本分割器库的乐趣来“分块”或分割一个包含科幻书籍的大型 str 文件,我想将其分割成 n_chunks,其中 n_lenght 重叠

这是我的代码:

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=30,
    chunk_overlap=5
)

text_raw = """
Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""

chunks=text_splitter.split_text(text_raw)

print(chunks)

print(f'\n\n {len(chunks)}')

但这是我的输出:

["Water is life's matter and matrix, mother, and medium. There is no life without water.\nSave water, secure the future.\nConservation is the key to a sustainable water supply.\nEvery drop saved today is a resource for tomorrow.\nLet's work together to keep our rivers flowing and our oceans blue."]


 1

我的目的是每 30 个字符进行分割并重叠最后/前导 5 个

例如,如果这是一个块:

'This is one Chunk after text splitting ABC'

然后我希望我的以下块类似于:

'splitting ABC This is my Second Chunk ---'' 

注意到下一个块的开头如何与前一个块的最后一个字符重叠?

这就是我正在寻找的,但很明显这不是该功能的工作原理,如果你们能帮助我,我将非常感激,我对 langchanin 很陌生,我已经检查了官方文档,但还没有'没有找到像我正在寻找的示例或教程

如果您可以指出或引用一个有趣的方法来在本地保存来自 LangChain 的块,或者如果我们必须坚持使用基本 Python 来做到这一点,这也将非常有帮助

python split langchain py-langchain
1个回答
0
投票

我认为子类CharacterTextSplitter并不能完全解决这个问题,因为它根据手动指定的分隔符来分割长字符串,例如“ “。似乎更好的选择是使用 TokenTextSplitter,它根据标记分割长字符串。这是我根据您的示例尝试的:

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    chunk_size=20,
    chunk_overlap=4
)

text_raw = """Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""

chunks = text_splitter.split_text(text_raw)

# chunks is a List[str]
for line in chunks:
    print(line.replace("\n", " "))

我注意到对于 TokenTextSplitter,chunk_size 和 chunk_overlap 指的是单词数,与其他一些子类不同,它们指的是字符数。所以我想这会是一个更好的选择。您确实必须更换任何机械分离器,例如“ “不过变成了一些自然的东西。

© www.soinside.com 2019 - 2024. All rights reserved.