网络刮python3.4提取款

问题描述 投票:0回答:3

我使用的请求和BS4报废数据从网页我有一个包含从网页中的段落几句一个字符串,我想知道如何提取包含它的全款。如果有谁知道怎么回事,请告诉我!谢谢 :)

python python-3.4
3个回答
3
投票

最显而易见的方法是刚刚遍历所有段落,并找到包含你的话语之一:

for p in soup.find_all('p'):
    if few_words in p.text:
        # found it, do something

0
投票

这里有一些非常简单的情况下,这是个好webscraping时有。这也部分回答您的问题,但因为你没有给更多的信息,我的数据和方法充其量是假设。

from bs4 import BeautifulSoup as bsoup
import re

html = """
<span>
    <div id="foo">
        The quick brown fox jumped
    </div>
    <p id="bar">
        over the lazy dog.
    </p>
</span>
"""

soup = bsoup(html)
soup.prettify()

# Find the div with id "foo" and get
# its inner text and print it.

foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f

print "-" * 50

# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.

bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b

print "-" * 50

# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent

if lazy_tag.name == "p":
    lazy_grandparent = lazy_tag.parent
    all_text = lazy_grandparent.get_text()
    all_text = " ".join(all_text.split())
    print all_text

结果:

        The quick brown fox jumped

--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.

0
投票
for para in request_soup.p.find_all(text=True,recursive=True):

你可以用它来提取段落即使是<p>标签来之前任何标记

© www.soinside.com 2019 - 2024. All rights reserved.