如何从下面这样的代码中提取地址? 我有一些公司数据的代码,如下所示。 我如何从其中提取每个地址?
# importing the libraries
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.nwctcoc.org/current_members_iframe.asp"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(string=re.compile("\Business Type:.*")):
business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())
下面是包含单个公司数据的代码。我尝试用计数行复制它们,但地址有 2 或 3 行...
<b>YMCA Camp Mohawk</b><br/>
Business Type: Non-Profit Organizations, Camps <br/>
P.O. Box 1209 <br/>
246 Great Hill Road, Cornwall <br/>
Litchfield, CT 06759 <br/>
Phone
: 860-672-6655 <br/>
Fax: 860-482-3878 <br/>
Contact: Patrick Marchand <br/>
Email: <a href="mailto:[email protected]">[email protected]</a> <br/>
Website: <a href="http://www.campmohawk.org" target="_blank">www.campmohawk.org</a><br/>
<br/> <br/>
</p>
如果您有固定的结构,例如地址始终位于
Business Type
字段和 <br/>
标签之后,并位于 Phone
之前,您可以使用脆弱的正则表达式来获取它:
import re
target_string = '''
<b>YMCA Camp Mohawk</b><br/>
Business Type: Non-Profit Organizations, Camps <br/>
P.O. Box 1209 <br/>
246 Great Hill Road, Cornwall <br/>
Litchfield, CT 06759 <br/>
Phone
: 860-672-6655 <br/>
'''
# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>((.|\n)*)Phone( |\n)*:", target_string)
# Extract match value of first capture group
print(result.group(1))
输出:
'''
P.O. Box 1209 <br/>
246 Great Hill Road, Cornwall <br/>
Litchfield, CT 06759 <br/>
'''
(我在正则表达式中保留了换行符,但您应该在使用正则表达式之前删除它们以使这更简单)例如
import re
target_string = '''
<b>YMCA Camp Mohawk</b><br/> Business Type: Non-Profit Organizations, Camps <br/> P.O. Box 1209 <br/> 246 Great Hill Road, Cornwall <br/> Litchfield, CT 06759 <br/> Phone: 860-672-6655 <br/>
'''
# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", target_string)
# Extract match value of first capture group
print(result.group(1))
一般来说,最好使用
Address
字段来构建数据,这样解析起来更容易、更安全。
我做错了什么,我有“p_body”作为列表而不是字符串? 我是新手抱歉麻烦了
# importing the libraries
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.nwctcoc.org/current_members_iframe.asp"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(sting=re.compile("\Business Type:.*")):
business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())
p_body = soup.find('p')
# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", p_body)
# Extract match value of first capture group
print(result.group(1))