Webscraping 标签

问题描述 投票:0回答:2

嗨,我写了以下代码来提取属性详细信息。

此刻我正在尝试提取区域。

import requests
from bs4 import BeautifulSoup

#Loads the webpage
r = requests.get("https://www.century21.com/for-sale-homes/Westport-CT-20647c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
#grabs the contect of this page
c=r.content

if "blocked" in r.text:
    print ("we've been blocked")



#makes the content more readable
soup=BeautifulSoup(c,"html.parser")

#Finds the number of proterty Listed
all=soup.find_all("div", {"class":"sr-card js-safe-link"})

x=all[0]

for li in x.find_all("li"):
    print(li)

以上代码将打印出以下内容

<li class="test-beds">6 beds</li>
<li class="test-baths">9 baths</li>
<li>8,511 sq ft</li>
<li>$370 / sq ft</li>
<li>On Site 2 days</li>
<li>Single Family Residence</li>

我的问题是如何提取“ 8,511平方英尺”的数据]

我尝试了print(li[2]),但不幸的是它没有用。

有人可以指出我在哪里犯错,并指出正确的方向来纠正它。

谢谢

python web-scraping
2个回答
0
投票

您需要使用.text来获取不带标签的内容。我还让它同时打印了li[2]li[2].text以显示差异

import requests
from bs4 import BeautifulSoup

#Loads the webpage
r = requests.get("https://www.century21.com/for-sale-homes/Westport-CT-20647c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
#grabs the contect of this page
c=r.content

if "blocked" in r.text:
    print ("we've been blocked")



#makes the content more readable
soup=BeautifulSoup(c,"html.parser")

#Finds the number of proterty Listed
all=soup.find_all("div", {"class":"sr-card js-safe-link"})

x=all[0]

# Store all elements with tag <li> in li
li = x.find_all("li")

# Print the element in index position 2
print (li[2])    
print (li[2].text)

0
投票

只需使用css选择器找到它

data = r.text

soup = BeautifulSoup(data)
number_li = soup.select( '.sr-card .js-safe-link ul li:nth-child(3)')

© www.soinside.com 2019 - 2024. All rights reserved.