在Python中使用BS4或Newspaper3k刮取元标签。

Question

在进行了详尽的搜索和许多变化之后，我感到很迷茫。我知道BS4(我也试过3)应该是可以刮取元标签的，但我似乎无法让它工作。有问题的元标签是封闭的 <properly /> 所以这不是它。它们总是存在（即使我设置了一个捕捉器以防万一），所以不是这样。我试过循环，我试过不同的格式。我甚至试过 Newspaper 和 Newspaper3k。最后，我尝试了 lxml、html5lib 和 html.parser 库，都没有用。

任何建议都会有帮助......请。

我的HTML源代码是这样的。

<meta name="description" content="Here is an exclusive we just got in regarding toda...." />
<meta property="og:description" content="Here is an exclusive we just got in regarding toda...." />
<meta property="article:section" content="Breaking News" />

我的python代码是这样的

# Import requisite libraries
from bs4 import BeautifulSoup


# Start it up (and note I have also tried lxml and html.parser)
soup = BeautifulSoup(corpus, 'html5lib')
# corpus is holding data from Newspaper3k. This aspect works.


# Following is just me trying different ways to find the same 2 things:

# Retrieve description AKA summary
description = soup.find("meta",  property="og:description")  # 1st way
summary = soup.find("meta",  attrs={'name': "description"})  # 2nd way

# Retrieve category AKA section
category = soup.find("meta",  property='article:section')  # 1st way
section = soup.find("meta",  attrs={'article': "section"})  # 2nd way


# Test and return result
print(description["content"] if description else "No description given")
print(summary["content"] if summary else "No summary given")
print(category["content"] if category else "No category given")
print(section["content"] if section else "No section given")

它总是返回。

No description given
No summary given
No category given
No section given

Answer 1

好吧... 我解决了这个问题。问题是我使用的是我从 Newspaper3k 中提取的语料库作为数据集。不要误解我的意思......这和标签上说的一样......但是元标签不会出现在那里，因为它只拉入了文章的正文和作者。

然而，当我使用BS4来拉取数据时，它实际上也拉取了底层数据（而不仅仅是文章正文），这意味着它现在有了元标签。

我们可以关闭这个，谢谢你的耐心等待。

正确的代码应该是这样的。

    url = urllib.request.urlopen('https://www.someurl.com/breakingnews/this-just-in/')
    content = url.read()
    soup = BeautifulSoup(content, 'lxml')

    description = soup.find("meta",  property="og:description")
    summary = soup.find("meta",  attrs={'name': "description"})
    category = soup.find("meta",  property='article:section')

    print(description["content"] if description else "No description given")
    print(summary["content"] if summary else "No summary given")
    print(category["content"] if category else "No category given")

然后剩下的就像之前一样

在Python中使用BS4或Newspaper3k刮取元标签。

问题描述投票：0回答：1

1个回答

最新问题

在Python中使用BS4或Newspaper3k刮取元标签。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1