如何使用beautifulsoup从多个URL中提取文本?

问题描述 投票:0回答:1

我正在进行潜在客户生成,并希望提取少数几个URL的文本。这是我要提取一个URL的代码。如果我要提取多个URL并将其保存到数据框中,该怎么办?

import urllib
from urllib.request import urlopen as urlopen
from bs4 import BeautifulSoup



url = 'https://www.wdtl.com/'
html = urlopen(url).read()
soup = BeautifulSoup(html)


for script in soup(["script", "style"]):
script.extract()    # rip it out

text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)
web-scraping
1个回答
0
投票

如果我对您的理解正确,则可以使用此简化方法到达那里。让我们看看它是否对您有用:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
headers={'User-Agent':'Mozilla/5.0'}
url = 'https://www.wdtl.com/'

resp = requests.get(url,headers = headers)
soup = bs(resp.content, "lxml")

#first, find the links
links = soup.find_all('link',href=True)

#create a list to house the links
all_links= []

#find each link and add it to the list
for link in links:
    if 'http' in link['href']: #the soup contains many non-http links; this will remove them
        all_links.append(link['href'])

#finally, load the list into a dataframe
df = pd.DataFrame(all_links) 
© www.soinside.com 2019 - 2024. All rights reserved.