如何从Python代码中提取地址行

Question

如何从下面这样的代码中提取地址？我有一些公司数据的代码，如下所示。我如何从其中提取每个地址？

# importing the libraries
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.nwctcoc.org/current_members_iframe.asp"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(string=re.compile("\Business Type:.*")):
  business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())

下面是包含单个公司数据的代码。我尝试用计数行复制它们，但地址有 2 或 3 行...

<b>YMCA Camp Mohawk</b><br/>

                   

                  Business Type: Non-Profit Organizations, Camps <br/>

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  Phone

                  

                  : 860-672-6655 <br/>

                   

                  Fax: 860-482-3878 <br/>

                   

                  Contact: Patrick Marchand <br/>

                   

                  Email: <a href="mailto:[email protected]">[email protected]</a> <br/>

                   

                  Website: <a href="http://www.campmohawk.org" target="_blank">www.campmohawk.org</a><br/>
<br/> <br/>
</p>

Answer 1

如果您有固定的结构，例如地址始终位于

Business Type

字段和

<br/>

标签之后，并位于

Phone

之前，您可以使用脆弱的正则表达式来获取它：

import re

target_string = '''
<b>YMCA Camp Mohawk</b><br/>

                   

                  Business Type: Non-Profit Organizations, Camps <br/>

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  Phone

                  

                  : 860-672-6655 <br/>
'''

# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>((.|\n)*)Phone( |\n)*:", target_string)

# Extract match value of first capture group
print(result.group(1))

输出：

'''

                   P.O. Box 1209 <br/>

                   246 Great Hill Road, Cornwall <br/>

                   Litchfield, CT  06759 <br/>

                  '''

（我在正则表达式中保留了换行符，但您应该在使用正则表达式之前删除它们以使这更简单）例如

import re

target_string = '''
<b>YMCA Camp Mohawk</b><br/> Business Type: Non-Profit Organizations, Camps <br/> P.O. Box 1209 <br/> 246 Great Hill Road, Cornwall <br/> Litchfield, CT  06759 <br/> Phone: 860-672-6655 <br/>
'''

# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", target_string)

# Extract match value of first capture group
print(result.group(1))

一般来说，最好使用

Address

字段来构建数据，这样解析起来更容易、更安全。

Answer 2

我做错了什么，我有“p_body”作为列表而不是字符串？我是新手抱歉麻烦了

# importing the libraries
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.nwctcoc.org/current_members_iframe.asp"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
business_types = []
for bt in soup.find_all(sting=re.compile("\Business Type:.*")):
  business_types.append(bt.text.replace('\r\n' and 'Business Type:', ' ').strip())


p_body = soup.find('p')


# two groups enclosed in separate ( and ) bracket
result = re.search(r"Business Type.*<br/>(.*)Phone:", p_body)

# Extract match value of first capture group
print(result.group(1))

如何从Python代码中提取地址行

问题描述投票：0回答：2

2个回答

最新问题

如何从Python代码中提取地址行

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2