Question

这是我的新代码。由于某种原因，它给出了下面提供的错误。有人知道为什么会这样吗？或者我可以用来解决这个问题的任何方法？

新守则：

import glob
import re

folder_path = "/home/"
file_pattern = "/**/*"
folder_contents = glob.glob(folder_path + file_pattern, recursive=True)

#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')

match_list=[]

for file in folder_contents:
    read_file = open(file, 'rt').read()
    if regex1.findall(read_file) or regex2.findall(read_file):

        email = regex1.findall(read_file)
        phone=regex2.findall(read_file)

        match_list.append(file)
        print (file)
        print (email)

以下是我收到的错误：

/home//sample.txt
['[email protected]', '[email protected]']
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-44-6281ab1fc0ff> in <module>()
     15 
     16 for file in folder_contents:
---> 17     read_file = open(file, 'rt').read()
     18     if regex1.findall(read_file) or regex2.findall(read_file):
     19 

/jupyterhub_env/lib/python3.5/codecs.py in decode(self, input, final)
    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte

我是否需要添加if else语句来指定文件类型或................................... .................................................. .................................................. .................................................. .................................................. ...

Answer 1

glob模块通过指定recursive=True来实现：

folder_path = "/home/e136320"
file_pattern = "/**/*"
folder_contents = glob.glob(folder_path + file_pattern, recursive=True)

Answer 2

您的文件显然不在Python正在检测的语言环境中。您的语言环境正在查找UTF-8数据，但看起来该文件采用其他编码方式。假设你主要在英语语言环境中工作，一些好的猜测将是cp1252和latin-1;尝试将encoding='cp1252'传递给open电话，看看它是否有效。 latin-1永远不会失败，但它可能产生胡言乱语，而Windows机器经常生成cp1252数据，这是一个很好的猜测。

仅打开和读取目录内目录的文件

问题描述投票：0回答：2

2个回答

最新问题

仅打开和读取目 录内目录的文件

问题描述 投票：0回答：2

2个回答

最新问题

仅打开和读取目录内目录的文件

问题描述投票：0回答：2