LangChaing 文本分割器和文档保存问题

Question

我正在尝试使用 langchain 文本分割器库的乐趣来“分块”或分割一个包含科幻书籍的大型 str 文件，我想将其分割成 n_chunks，其中 n_lenght 重叠

这是我的代码：

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=30,
    chunk_overlap=5
)

text_raw = """
Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""

chunks=text_splitter.split_text(text_raw)

print(chunks)

print(f'\n\n {len(chunks)}')

但这是我的输出：

["Water is life's matter and matrix, mother, and medium. There is no life without water.\nSave water, secure the future.\nConservation is the key to a sustainable water supply.\nEvery drop saved today is a resource for tomorrow.\nLet's work together to keep our rivers flowing and our oceans blue."]


 1

我的目的是每 30 个字符进行分割并重叠最后/前导 5 个

例如，如果这是一个块：

'This is one Chunk after text splitting ABC'

然后我希望我的以下块类似于：

'splitting ABC This is my Second Chunk ---''

注意到下一个块的开头如何与前一个块的最后一个字符重叠？

这就是我正在寻找的，但很明显这不是该功能的工作原理，如果你们能帮助我，我将非常感激，我对 langchanin 很陌生，我已经检查了官方文档，但还没有'没有找到像我正在寻找的示例或教程

如果您可以指出或引用一个有趣的方法来在本地保存来自 LangChain 的块，或者如果我们必须坚持使用基本 Python 来做到这一点，这也将非常有帮助

Answer 1

我认为子类CharacterTextSplitter并不能完全解决这个问题，因为它根据手动指定的分隔符来分割长字符串，例如“ “。似乎更好的选择是使用 TokenTextSplitter，它根据标记分割长字符串。这是我根据您的示例尝试的：

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter.from_tiktoken_encoder(
    chunk_size=20,
    chunk_overlap=4
)

text_raw = """Water is life's matter and matrix, mother, and medium. There is no life without water.
Save water, secure the future.
Conservation is the key to a sustainable water supply.
Every drop saved today is a resource for tomorrow.
Let's work together to keep our rivers flowing and our oceans blue.
"""

chunks = text_splitter.split_text(text_raw)

# chunks is a List[str]
for line in chunks:
    print(line.replace("\n", " "))

我注意到对于 TokenTextSplitter，chunk_size 和 chunk_overlap 指的是单词数，与其他一些子类不同，它们指的是字符数。所以我想这会是一个更好的选择。您确实必须更换任何机械分离器，例如“ “不过变成了一些自然的东西。

LangChaing 文本分割器和文档保存问题

问题描述投票：0回答：1

1个回答

最新问题

LangChaing 文本分割器和文档保存问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1