`.find('li')` 给出 None 即使 `<li>` 标签存在于 soup

Question

我正在尝试在

requests.get()

之后用 beautifulsoup 解析 url 内容[未在代码中显示]。使用的解析器是

"html.parser"

。我在一个大脚本中有以下代码片段。

print(f"subheading : {subheading}")
print(f"type : {type(subheading)}")
print(f"dir : {dir(subheading)}")
if subheading.find('ul'):
    print(f"Going for next level subheading search")
else:
    c2 = subheading.find("li")
    print(f"c2 : {c2}")

第一个打印语句在标准输出中给了我这个：

subheading : <li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&amp;timeline=True">PRIN 1.1 Application and purpose</a></li>

我添加了类型检查和属性列表检查，只是为了确认我是否做错了什么。第二个和第三个打印语句给了我这个：

type : <class 'bs4.element.Tag'>
dir : ['DEFAULT_INTERESTING_STRING_TYPES', '__bool__', '__call__', '__class__', '__contains__', '__copy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_strings', '_find_all', '_find_one', '_is_xml', '_lastRecursiveChild', '_last_descendant', '_should_pretty_print', 'append', 'attrs', 'can_be_empty_element', 'cdata_list_attributes', 'childGenerator', 'children', 'clear', 'contents', 'decode', 'decode_contents', 'decompose', 'decomposed', 'default', 'descendants', 'encode', 'encode_contents', 'extend', 'extract', 'fetchNextSiblings', 'fetchParents', 'fetchPrevious', 'fetchPreviousSiblings', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings', 'find_all', 'find_all_next', 'find_all_previous', 'find_next', 'find_next_sibling', 'find_next_siblings', 'find_parent', 'find_parents', 'find_previous', 'find_previous_sibling', 'find_previous_siblings', 'format_string', 'formatter_for_name', 'get', 'getText', 'get_attribute_list', 'get_text', 'has_attr', 'has_key', 'hidden', 'index', 'insert', 'insert_after', 'insert_before', 'interesting_string_types', 'isSelfClosing', 'is_empty_element', 'known_xml', 'name', 'namespace', 'next', 'nextGenerator', 'nextSibling', 'nextSiblingGenerator', 'next_element', 'next_elements', 'next_sibling', 'next_siblings', 'parent', 'parentGenerator', 'parents', 'parserClass', 'parser_class', 'prefix', 'preserve_whitespace_tags', 'prettify', 'previous', 'previousGenerator', 'previousSibling', 'previousSiblingGenerator', 'previous_element', 'previous_elements', 'previous_sibling', 'previous_siblings', 'recursiveChildGenerator', 'renderContents', 'replaceWith', 'replaceWithChildren', 'replace_with', 'replace_with_children', 'select', 'select_one', 'setup', 'smooth', 'sourceline', 'sourcepos', 'string', 'strings', 'stripped_strings', 'text', 'unwrap', 'wrap']

但是我无法在 else 部分成功执行

.find('li')

操作。

c2

始终是

NoneType

。

我也尝试过这些：

c2 = subheading.a

但这也是

NoneType

。

我已经尝试过

c2 = subheading.find_all("li")

但是

c2

是一个空名单。

我的最终目标是首先检查

li

标签是否存在，然后找到

标签，如果存在，则访问

href

链接和

text

标签的

<a>

。

注意：我尝试在终端中重新创建相同的东西，它给出了正确的

li

标签。我尝试将

subheading

保留在字符串

中，然后执行

bs(h, 'html.parser')

，其中

.find('li')

可以工作，但在运行脚本时它给了我 NoneType。 但是这两个对象的类型不同。 脚本中的一个是

<class 'bs4.element.Tag'>

，但在终端中重新创建的一个是

<class 'bs4.BeautifulSoup'>

。不同的对象类型是否以某种方式反对属性访问或类似的东西？

为什么

.find('li')

或其他进程给我无类型或失败，即使标签存在？我做错了什么？

Answer 1

我找到了一种滑稽的方法来绕过我面临的非类型错误。由于

subheading

类型的变量

bs4.element.Tag

和另一方面

bs4.BeautifulSoup

类型对象给出了正确的

li

标签，我想到将

subheading

类型转换为字符串，然后再次用 beautifulsoup 解析它，以便它类型更改为

bs4.BeautifulSoup

然后执行

.find('li')

效果非常好。

我将代码更改为：

subheading_str = str(subheading)
subheading_soup = bs(subheading_str, "html.parser")
if subheading_soup.find("ul"):
    print(f"Going for next level subheading search")
else:
    c2 = subheading.find("li")
    print(f"c2 : {c2}") # Not nonetype this time, gives correct result
    if c2:
       # next code part

注意 - 这可能不是解决问题的正确/技术上正确的方法，但对我有用。

Answer 2

我认为您可能需要澄清一下您的代码，因为我得到的信息与您不同。

pip3 install bs4

然后：

from bs4 import BeautifulSoup
s = """<li><a href="/handbook/PRIN/1/1.html?date=2022-10-14&amp;timeline=True">PRIN 1.1 Application and purpose</a></li>"""
soup = BeautifulSoup(s)
soup.find("li")
# Returns the Correct LI.

如果这不能解决您的问题，则您实际尝试查找的内容可能存在问题。再看看你的字符串以确认它是正确的。

BeautifulSoup 文档位于：https://www.crummy.com/software/BeautifulSoup/bs4/doc/，可能会帮助您获取正确的数据格式进行查询。

`.find('li')` 给出 None 即使 `<li>` 标签存在于 soup

问题描述投票：0回答：2

2个回答

最新问题

`.find('li')` 给出 None 即使 `<li>` 标签存在于 soup

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2