这个模块应该过滤和清理文本,但是我一直无法让它发挥作用。
> chat_export_file = "C:\\Users\\User\\OneDrive\\Documents\\chatty.txt"
> def remove_chat_metadata(chat_export_file):
> pattern = r"(\d+\/\d+\/\d+,\s\d+:\d+)\s-\s([\w\s]+):\s"
>
> with open(chat_export_file, "r") as corpus_file:
> content = corpus_file.read()
> cleaned_corpus = re.sub(pattern, "", content)
> return tuple(cleaned_corpus.split("\n"))
>
> def clean_corpus(chat_export_file):
> message_corpus = remove_chat_metadata(chat_export_file)
> cleaned_corpus = remove_non_message_text(message_corpus)
> return cleaned_corpus
> cleaned_corpus = clean_corpus(chat_export_file)
> print(cleaned_corpus)
我期待它能清理和过滤文本,但它只是给我这个错误:
> Traceback (most recent call last):
> File "C:\Users\User\AppData\Local\Programs\Python\Python310\cleaner.py", line 25, in <module>
> cleaned_corpus = clean_corpus(chat_export_file)
> File "C:\Users\User\AppData\Local\Programs\Python\Python310\cleaner.py", line 22, in clean_corpus
> message_corpus = remove_chat_metadata(chat_export_file)
> File "C:\Users\User\AppData\Local\Programs\Python\Python310\cleaner.py", line 17, in remove_chat_metadata
> content = corpus_file.read()
> File "C:\Users\User\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
> return codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 8545: character maps to <undefined>
我不知道这个错误是什么意思或可能导致它的原因,任何帮助将不胜感激,谢谢!