我的最终目标是读取文件的内容并创建数据的矢量存储,以便稍后查询。
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
loader = TextLoader("elon_musk.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
我的数据文件似乎存在一些问题,因此无法读取我的文件内容。是否可以加载 utf-8 格式的文件?我的假设是使用 utf-8 编码我不应该面临这个问题。
以下是我在代码中遇到的错误:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:41, in TextLoader.load(self)
40 with open(self.file_path, encoding=self.encoding) as f:
---> 41 text = f.read()
42 except UnicodeDecodeError as e:
File ~\anaconda3\envs\langchain-test\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to <undefined>
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Cell In[1], line 8
4 from langchain.document_loaders import TextLoader
7 loader = TextLoader("elon_musk.txt")
----> 8 documents = loader.load()
9 text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
10 docs = text_splitter.split_documents(documents)
File ~\anaconda3\envs\langchain-test\lib\site-packages\langchain\document_loaders\text.py:54, in TextLoader.load(self)
52 continue
53 else:
---> 54 raise RuntimeError(f"Error loading {self.file_path}") from e
55 except Exception as e:
56 raise RuntimeError(f"Error loading {self.file_path}") from e
RuntimeError: Error loading elon_musk.txt
如何解决这个问题?
我也遇到了同样的问题。代码在 Colab (Unix) 中运行良好,但在 VS code 中则不然。尝试了马克的建议,但没有成功。检查 VSCode 的编码首选项是否为 UTF-8。已验证两台计算机上的文件完全相同。甚至确保他们有相同的 python 版本!
这对我有用。 使用TextLoader时,这样做:
loader = TextLoader("elon_musk.txt", encoding = 'UTF-8')
使用 DirectoryLoader 时,而不是这样:
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader)
这样做:
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader("./new_articles/", glob="./*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
它看起来不像 LangChain 问题,而只是输入文件中的编码与 Unicode 不一致。
在关注点分离之后,我会首先将文件重新编码为兼容的unicode,然后将其传递给LangChain:
# Read the file using the correct encoding
with open("elon_musk.txt", "r", encoding="utf-8") as f:
text = f.read()
# Write the text back to a new file, ensuring it's in UTF-8 encoding
with open("elon_musk_utf8.txt", "w", encoding="utf-8") as f:
f.write(text)
loader = TextLoader("elon_musk_utf8.txt")
documents = loader.load()
[可选]如果使用 UTF-8 编码的第一个读取方法失败(由于输入文件中存在一些意外的外来字符编码),我会让 Python 自动找出文件的实际编码是什么并传递它到打开方法。为了检测实际的编码,我会这样使用
chardet
库:
import chardet
def detect_encoding(file_path):
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding']
encoding = detect_encoding("elon_musk.txt")
with open("elon_musk.txt", 'r', encoding=encoding) as f:
text = f.read()
with open("elon_musk_utf8.txt", 'w', encoding='utf-8') as f:
f.write(text)
loader = TextLoader("elon_musk_utf8.txt")
documents = loader.load()
您可以使用以下代码加载和拆分文档:
with open('test.txt', 'w') as f:
f.write(doc.decode('utf-8'))
with open('test.txt', 'r') as f:
text = f.read()
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def count_tokens(text: str) -> int:
return len(tokenizer.encode(text))
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 64 ,
chunk_overlap = 24,
length_function = count_tokens,
)
chunks = text_splitter.create_documents([text])
尝试 DirectoryLoader,它有效。