还在使用带有beautifulsoup的python代码学习网页抓取,并且偶然发现了格式化问题。
代码从网站上提取正确的数据,但它没有放在正确的列中。
例如:
列"unit_size"
应该有==> 5' x 8' x 10'
但是它每隔一行写一次维度(以及应该在以下列中的其他信息)。
列"unit_type"
应该有==> "Drive Up 1st Floor Outside Level No Climate"
列"online_price"
应该有==> "$74.95"
列"street_address"
应该有==> 1224 N Tryon St Charlotte NC 28206"
你们/女孩们是一个很好的帮助。
下面是python代码:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
urls = ['https://www.uhaul.com/Locations/Self-Storage-near-Charlotte-NC-28206/780052/'
, 'https://www.uhaul.com/Locations/Self-Storage-near-Charlotte-NC-28212/780063/']
filename = "u_haul.csv"
open(filename, 'w').close()
f = open(filename, "a")
num = 0
headers = "unit_size, unit_type, online_price, street_address\n"
f.write(headers)
for my_url in urls:
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
street_address = page_soup.find("div", {"class": "address"}).text
#store_city = page_soup.find("span", {"": ""}).text
#store_postalcode = page_soup.find("span", {"": ""}).text
containers = page_soup.findAll("div", {"class": "row"})
for container in containers:
title_container = container.findAll("div", {"class": "medium-4 medium-offset-2 small-7 columns"})
unit_type = container.findAll("p", {"class": "collapse"})
online_price = container.findAll("div", {"class": "medium-3 column"})
for item in zip(title_container, unit_type, online_price ):
csv = item[0].text + "," + item[1].text + "," + item[2].text + "," + street_address + "\n"
f.write(csv)
num += 1
f.close()
下面是容器的HTML:
<div class="row">
<div class="medium-6 columns">
<button class="pull-left toggle-trigger no-toggle-icon show-for-small-only" data-keep-events="" data-toggle-id="mainMenu" id="menuToggle">
<i class="fa fa-bars"></i>
</button>
<!-- mp_trans_remove_start -->
<button class="pull-right toggle-trigger no-toggle-icon show-for-small-only" data-keep-events="" data-toggle-id="searchBox" id="searchToggle">
<i class="fa fa-search"></i>
</button>
<!-- mp_trans_remove_end -->
<a aria-label="Shopping Cart" class="pull-right button show-for-small-only" href="/Cart.aspx" id="header_cart_mobilie">
<i class="fa fa-shopping-cart"></i>
</a>
<div class="logo">
<a class="show-for-medium-up" href="/" id="header_logo_desktop">
<img alt="U-Haul" src="/Images/uhaul-logo.png?v=1290732713"/>
<img alt="Your moving and storage resource." src="/Images/uhaul_tagline.png?v=629728584"/>
</a>
<a class="show-for-small-only" href="/" id="header_logo_mobile">
<img alt="U-Haul" src="/Images/uhaul_logo_white.png?v=291560867"/>
</a>
</div>
</div>
<div class="medium-6 columns">
<ul class="inline text-right show-for-medium-up">
<li>
<a href="/Cart.aspx" id="header_cart">
<i class="fa fa-shopping-cart"></i>
Cart
</a>
</li>
<li>
<a href="/Orders/" id="header_signinlookup">
<i class="fa fa-sign-in"></i>
Sign in / look up order
</a>
</li>
<li>
<a href="/Locations/" id="header_locations">
<i class="fa fa-map-marker"></i>
Locations
</a>
</li>
</ul>
</div>
</div>
以下是地址的HTML:
[ < div class = "address" >
<
p class = "collapse" >
<
span > 1224 N Tryon St < /span> <
br / >
<
span > Charlotte < /span>, <
span > NC < /span> <
span > 28206 < /span><br/ >
<
/p>
以下是“unit_size”和“unit_type”列的HTML:
<div class="medium-4 medium-offset-2 small-7 columns">
<h4 class="">
5' x 8' x 10'
</h4>
<p class="collapse">
Drive Up 1st Floor Outside Level No Climate <br/> Miscellaneous Storage (up to 2 rooms) <br/>
<em></em>
</p>
</div>
最后是“online_price”列的HTML:
<div class="medium-3 column">
<p>
<strong class="text-large ">
$74.95
</strong>
<br/> per month
</p>
</div>
Web浏览器不关心空格和制表符。它总是将许多空格显示为一个空格 - 但你必须使用strip()
,split()
,join()
,replace()
等标准字符串函数删除它们。
您还应该使用模块csv
,因为有时字符串可能有comma
(即街道地址)或return/enter
,您必须将文本放在" "
中以正确保存在CSV
文件中。
模块csv
将自动执行。
你也可以链接函数find
,findAll
,find_all
,select
,select_one
page_soup.find('div', {'id': 'roomTypes'}).findAll("div", {"class": "row"})
完整代码
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
urls = [
'https://www.uhaul.com/Locations/Self-Storage-near-Charlotte-NC-28206/780052/',
'https://www.uhaul.com/Locations/Self-Storage-near-Charlotte-NC-28212/780063/'
]
filename = 'u_haul.csv'
f = open(filename, 'a+') # a+ will create file
csv_writer = csv.writer(f) # use csv module because some data may have comma or enter.
headers = ['title', 'unit_size', 'unit_type', 'online_price', 'street_address']
csv_writer.writerow(headers)
for my_url in urls:
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
street_address = page_soup.find("div", {"class": "address"}).text
street_address = ' '.join(street_address.split())
print('street_address>', street_address, '<')
print('---------------------------------------------------')
#store_city = page_soup.find("span", {"": ""}).text
#store_postalcode = page_soup.find("span", {"": ""}).text
containers = page_soup.find('div', {'id': 'roomTypes'}).findAll("div", {"class": "row"}) # <-- changed
for container in containers:
title_container = container.find("div", {"class": "medium-4 medium-offset-2 small-7 columns"})
unit_size = container.find("h4") # <-- changed
unit_type = container.find("p", {"class": "collapse"})
online_price = container.find("strong", {"class": "text-large "}) # <-- changed
if title_container: # some rows doesn't have data
title = ' '.join(title_container.text.split())
size = ' '.join(unit_size.text.split())
unit = ' '.join(unit_type.text.split())
price = online_price.text.strip()
print('title>', title, '<')
print('size>', size, '<')
print('unit>', unit, '<')
print('price>', price, '<')
print('-----')
csv_writer.writerow([title, size, unit, price, street_address])
f.close()
结果:
street_address> 1224 N Tryon St Charlotte, NC 28206 <
---------------------------------------------------
title> 5' x 8' x 10' Drive Up 1st Floor Outside Level No Climate Miscellaneous Storage (up to 2 rooms) <
size> 5' x 8' x 10' <
unit> Drive Up 1st Floor Outside Level No Climate Miscellaneous Storage (up to 2 rooms) <
price> $74.95 <
-----
title> 4' x 12' x 10' Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
size> 4' x 12' x 10' <
unit> Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
price> $79.95 <
-----
title> 5' x 10' x 10' Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
size> 5' x 10' x 10' <
unit> Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
price> $84.95 <
-----
title> 5' x 14' x 10' Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
size> 5' x 14' x 10' <
unit> Interior 1st Floor Street Level No Climate 1-2 Bedroom Home (up to 1,200 sq. ft.) <
price> $89.95 <