目前,我被分配去制作可刮除链接的网络抓取工具。我可以成功提取此数据:
/
/users/sign_up
/topics
/smarties
/posts
/users/sign_in
/users/sign_up
/posts/installing-anaconda-python-data-science-platform
/topics/python
/topics/anaconda-python
/topics/machine-learning
/jordan
/posts/python-libraries-to-import-for-data-science-programs
/topics/python
/topics/data-science
/topics/machine-learning
/jordan
/posts/shortcut-for-opening-the-object-inspector-in-python-spyder
/topics/python
/topics/anaconda-python
/topics/spyder-python
/topics/machine-learning
/jordan
/posts/python-script-for-replacing-missing-data-in-a-machine-learning-algorithm
/topics/machine-learning
/topics/python
/jordan
/posts/python-script-for-pulling-in-the-same-column-from-an-array-of-arrays
/topics/python
/jordan
/posts/how-to-implement-fizzbuzz-in-python
/topics/fizzbuzz
/topics/python
/jordan
/posts/how-to-think-like-a-computer-scientist
/topics/computer-science
/topics/python
/topics/programming
/jordan
/posts/base-case-example-for-how-to-test-a-python-class
/topics/python
/topics/tdd
/jordan
/posts/installing-and-working-with-pipenv
/topics/pipenv
/topics/python
/jordan
/posts/steps-for-building-a-flask-api-application-with-python-3
/topics/flask
/topics/tutorial
/topics/python
/jordan
None
/topics/python?page=2
/topics/python?page=3
/topics/python?page=4
/topics/python?page=2
/topics/python?page=4
运行此代码后
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.dailysmarty.com/topics/python')
soup = bs(r.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
但是当我运行这个正在生成的发电机时:
def generator(web):
titles = []
for link in web:
if 'posts' in link.get('href'):
print(link.get('href'))
else:
pass
data = soup.find_all('a')
#generator(data)
我得到此数据和这些回调错误:
/posts
/posts/installing-anaconda-python-data-science-platform
/posts/python-libraries-to-import-for-data-science-programs
/posts/shortcut-for-opening-the-object-inspector-in-python-spyder
/posts/python-script-for-replacing-missing-data-in-a-machine-learning-algorithm
/posts/python-script-for-pulling-in-the-same-column-from-an-array-of-arrays
/posts/how-to-implement-fizzbuzz-in-python
/posts/how-to-think-like-a-computer-scientist
/posts/base-case-example-for-how-to-test-a-python-class
/posts/installing-and-working-with-pipenv
/posts/steps-for-building-a-flask-api-application-with-python-3
Traceback (most recent call last):
File "C:\Users\joshu\AppData\Local\Programs\Python\Python38\classes.py", line 18, in <module>
generator(data)
File "C:\Users\joshu\AppData\Local\Programs\Python\Python38\classes.py", line 13, in generator
if 'posts' in link.get('href'):
TypeError: argument of type 'NoneType' is not iterable
我如何做到这一点,以便在运行生成器时,可以在for循环中通过None而不导致代码中出现错误?
您必须检查链接是否确实具有"href"
属性:
if link.has_attr('href') and 'posts' in link.get('href'):