如何从Python代码中提取地址行

问题描述 投票:0回答:2

如何从下面这样的代码中提取地址? 我有一些公司数据的代码,如下所示。 我如何从其中提取每个地址?

# importing the libraries
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.nwctcoc.org/current_members_iframe.asp"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(string=re.compile("\Business Type:.*")):
  business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())

下面是包含单个公司数据的代码。我尝试用计数行复制它们,但地址有 2 3 行...

<b>YMCA Camp Mohawk</b><br/>

                   

                  Business Type: Non-Profit Organizations, Camps <br/>

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  Phone

                  

                  : 860-672-6655 <br/>

                   

                  Fax: 860-482-3878 <br/>

                   

                  Contact: Patrick Marchand <br/>

                   

                  Email: <a href="mailto:[email protected]">[email protected]</a> <br/>

                   

                  Website: <a href="http://www.campmohawk.org" target="_blank">www.campmohawk.org</a><br/>
<br/> <br/>
</p>
python
2个回答
0
投票

如果您有固定的结构,例如地址始终位于

Business Type
字段和
<br/>
标签之后,并位于
Phone
之前,您可以使用脆弱的正则表达式来获取它:

import re

target_string = '''
<b>YMCA Camp Mohawk</b><br/>

                   

                  Business Type: Non-Profit Organizations, Camps <br/>

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  Phone

                  

                  : 860-672-6655 <br/>
'''

# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>((.|\n)*)Phone( |\n)*:", target_string)

# Extract match value of first capture group
print(result.group(1))

输出:

'''

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  '''

(我在正则表达式中保留了换行符,但您应该在使用正则表达式之前删除它们以使这更简单)例如

import re

target_string = '''
<b>YMCA Camp Mohawk</b><br/> Business Type: Non-Profit Organizations, Camps <br/> P.O. Box 1209 <br/> 246 Great Hill Road, Cornwall <br/> Litchfield, CT  06759 <br/> Phone: 860-672-6655 <br/>
'''

# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", target_string)

# Extract match value of first capture group
print(result.group(1))

一般来说,最好使用

Address
字段来构建数据,这样解析起来更容易、更安全。


-1
投票

我做错了什么,我有“p_body”作为列表而不是字符串? 我是新手抱歉麻烦了

# importing the libraries
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.nwctcoc.org/current_members_iframe.asp"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(sting=re.compile("\Business Type:.*")):
  business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())


p_body = soup.find('p')


# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", p_body)

# Extract match value of first capture group
print(result.group(1))
© www.soinside.com 2019 - 2024. All rights reserved.