我正在尝试使用 langchain 文本分割器库的乐趣来“分块”或分割一个包含科幻书籍的大型 str 文件,我想将其分割成 n_chunks,其中 n_lenght 重叠
这是我的代码:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
chunk_size=30,
chunk_overlap=5
)
text_raw = """
Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""
chunks=text_splitter.split_text(text_raw)
print(chunks)
print(f'\n\n {len(chunks)}')
但这是我的输出:
["Water is life's matter and matrix, mother, and medium. There is no life without water.\nSave water, secure the future.\nConservation is the key to a sustainable water supply.\nEvery drop saved today is a resource for tomorrow.\nLet's work together to keep our rivers flowing and our oceans blue."]
1
我的目的是每 30 个字符进行分割并重叠最后/前导 5 个
例如,如果这是一个块:
'This is one Chunk after text splitting ABC'
然后我希望我的以下块类似于:
'splitting ABC This is my Second Chunk ---''
注意到下一个块的开头如何与前一个块的最后一个字符重叠?
这就是我正在寻找的,但很明显这不是该功能的工作原理,如果你们能帮助我,我将非常感激,我对 langchanin 很陌生,我已经检查了官方文档,但还没有'没有找到像我正在寻找的示例或教程
如果您可以指出或引用一个有趣的方法来在本地保存来自 LangChain 的块,或者如果我们必须坚持使用基本 Python 来做到这一点,这也将非常有帮助
我认为子类CharacterTextSplitter并不能完全解决这个问题,因为它根据手动指定的分隔符来分割长字符串,例如“ “。似乎更好的选择是使用 TokenTextSplitter,它根据标记分割长字符串。这是我根据您的示例尝试的:
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter.from_tiktoken_encoder(
chunk_size=20,
chunk_overlap=4
)
text_raw = """Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""
chunks = text_splitter.split_text(text_raw)
# chunks is a List[str]
for line in chunks:
print(line.replace("\n", " "))
我注意到对于 TokenTextSplitter,chunk_size 和 chunk_overlap 指的是单词数,与其他一些子类不同,它们指的是字符数。所以我想这会是一个更好的选择。您确实必须更换任何机械分离器,例如“ “不过变成了一些自然的东西。