试图从网页Python和BeautifulSoup获取编码

Question

我正在尝试从网页中检索字符集（这将一直更改）。目前，我正在使用beautifulSoup解析页面，然后从标题中提取字符集。直到我遇到一个拥有.....]

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
到目前为止，我的代码可以在其他页面上使用：

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                encod = soup.meta.get('content')
    return encod
任何人都对如何添加此代码以从上述示例中检索字符集有个好主意。将其标记化并尝试以这种方式检索字符集是一个主意吗？以及如何在不更改整个功能的情况下进行操作？现在，上面的代码返回“ text / html; charset = utf-8”，这会导致LookupError，因为这是未知编码。

谢谢

我最终使用的最终代码：

    def get_encoding(soup):
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    dic_of_possible_encodings = chardet.detect(unicode(soup))
                    encod = dic_of_possible_encodings['encoding'] 
    return encod

我正在尝试从网页中检索字符集（这将一直更改）。目前，我正在使用beautifulSoup解析页面，然后从标题中提取字符集。这正在工作...

Answer 1

import re
def get_encoding(soup):
    if soup and soup.meta:
        encod = soup.meta.get('charset')
        if encod == None:
            encod = soup.meta.get('content-type')
            if encod == None:
                content = soup.meta.get('content')
                match = re.search('charset=(.*)', content)
                if match:
                    encod = match.group(1)
                else:
                    raise ValueError('unable to find encoding')
    else:
        raise ValueError('unable to find encoding')
    return encod

试图从网页Python和BeautifulSoup获取编码

问题描述投票：2回答：1

1个回答

最新问题

试图从网页Python和BeautifulSoup获取编码

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1