试图从网页中提取一些数据(抓取初学者)

问题描述 投票:1回答:1

我正在尝试使用Requests和Beautifulsoup从网页中提取一些数据。我首先获取带有请求的html代码,然后将其“放入”Beautifulsoup中:

from bs4 import BeautifulSoup
import requests


result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')

然后我挑出了一些代码:

tags = soup.findAll('ol',{'class':'activity-popup-users'})

print(tags)

这是我得到的一部分:

<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">

我现在想要的是在data-user-id=之后提取数据,""from bs4 import BeautifulSoup import requests result = requests.get("https://XXXX") src = result.content soup = BeautifulSoup(src, 'html.parser') tags = soup.findAll('ol',{'class':'activity-popup-users'}) print(tags['data-user-id']) 之间的数字组成。然后我希望将这些数据输入到某种计算表中。我是一个绝对的初学者,我后来粘贴我在其他地方的教程或文档中找到的代码。非常感谢你的时间......

编辑:所以这是我尝试过的:

TypeError: list indices must be integers or slices, not str

这就是我得到的:

from bs4 import BeautifulSoup 
import requests 
result = requests.get("https://XXXX") 
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'}) 
tags = soup.findAll('ol',{'class':'activity-popup-users'}) 
tags.attrs
#print(tags['data-user-id'])

所以我试过了:

File "C:\Users\XXXX\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key

AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

得到了:

Read the BeautifulSoup documentation on attributes.
python web-scraping beautifulsoup python-requests html-parsing
1个回答
1
投票

您可以通过将标记视为属性值字典来获取标记的任何属性值。

tag['data-user-id']

html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])

例如

3787869561

产量

from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
    print(div['data-user-id'])
#write to a file    
with open('file.txt','w') as f:
   for div in divs:
        f.write(div['data-user-id']+'\n')

编辑以包括OP的问题更改:

255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...

输出:

qazxswpoi
© www.soinside.com 2019 - 2024. All rights reserved.