我使用的请求和BS4报废数据从网页我有一个包含从网页中的段落几句一个字符串,我想知道如何提取包含它的全款。如果有谁知道怎么回事,请告诉我!谢谢 :)
最显而易见的方法是刚刚遍历所有段落,并找到包含你的话语之一:
for p in soup.find_all('p'):
if few_words in p.text:
# found it, do something
这里有一些非常简单的情况下,这是个好webscraping时有。这也部分回答您的问题,但因为你没有给更多的信息,我的数据和方法充其量是假设。
from bs4 import BeautifulSoup as bsoup
import re
html = """
<span>
<div id="foo">
The quick brown fox jumped
</div>
<p id="bar">
over the lazy dog.
</p>
</span>
"""
soup = bsoup(html)
soup.prettify()
# Find the div with id "foo" and get
# its inner text and print it.
foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f
print "-" * 50
# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.
bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b
print "-" * 50
# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent
if lazy_tag.name == "p":
lazy_grandparent = lazy_tag.parent
all_text = lazy_grandparent.get_text()
all_text = " ".join(all_text.split())
print all_text
结果:
The quick brown fox jumped
--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.
for para in request_soup.p.find_all(text=True,recursive=True):
你可以用它来提取段落即使是<p>
标签来之前任何标记