我正在为食谱网站制作网络爬虫,我想获取食谱的链接,然后使用该链接获取配料。我可以做到这一点,但只能通过手动输入链接来获取食谱。有没有办法获得链接,然后使用此链接查看成分。我也将接受有关如何使此代码变得更好的任何建议!
def trade_spider():
url= 'https://tasty.co/topic/best-vegetarian'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
test = link.get('href')
print(test)
def ingredient_spider():
url1= 'https://tasty.co/recipe/peanut-butter-keto-cookies'
source_code1= requests.get(url1)
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
print(ingredients.text)
为此,请确保将您的输出设置为
return
而不是 print
(要了解差异,请尝试阅读这篇文章的最佳答案:“打印”和“返回”之间的正式区别是什么“?)
然后,您可以使用函数的输出作为变量,或者将输出直接放入下一个函数中。 例如
x = tradespider()
或
newFunction(tradespider())
您需要为从食谱中获得的每个链接调用 ingredient_spider 函数。 使用您的示例,它看起来像这样:
def trade_spider():
url= 'https://tasty.co/topic/best-vegetarian'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
test = link.get('href')
ingredient_spider(test)
def ingredient_spider(url):
source_code1= requests.get(url) #receive url from trade_spider function
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
print(ingredients.text)
对于从 test = link.get('href') 获得的每个链接,您可以调用函数cipher_spider(),发送测试变量作为参数。
老实说,我不确定我是否正确理解了你的要求,但如果我理解了,你可以这样做:
.
def first():
URLs = []
...
for link in soup.find_all('a', {'class':'feed-item analyt-internal-link-subunit'}):
URLs.append(link.get('href'))
return URLs
def second(url):
source_code1= requests.get(url)
new_text= source_code1.text
soup1= BeautifulSoup(new_text, 'lxml')
for ingredients in soup1.find_all("li", {"class": "ingredient xs-mb1 xs-mt0"}):
return ingredients.text
def third(URL_LIST):
for URL in URL_LIST:
tmp = second(URL)
print(tmp)
URL_LIST = first()
third(URL_LIST)