我正在尝试从 URL 列表中提取域名。就像在
https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
我的问题是 URL 可以包含所有内容,仅举几个例子:
m.google.com
=> google
m.docs.google.com
=> google
www.someisotericdomain.innersite.mall.co.uk
=> mall
www.ouruniversity.department.mit.ac.us
=> mit
www.somestrangeurl.shops.relevantdomain.net
=> relevantdomain
www.example.info
=> example
使用
tldextract
,它是urlparse
的更高效版本,tldextract
可以准确地将gTLD
或ccTLD
(通用或国家代码顶级域名)与已注册的domain
和subdomains
区分开来网址。
>>> import tldextract
>>> ext = tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
>>> ext.domain
'cnn'
看来你可以使用 urlparse https://docs.python.org/3/library/urllib.parse.html 来获取该 url,然后提取 netloc。
从 netloc 中,您可以使用 split 轻松提取域名
用于从 url 中提取域
from urllib.parse import urlparse
url = "https://stackoverflow.com/questions/44021846/extract-domain-name-from-url-in-python"
domain = urlparse(url).netloc
"stackoverflow.com"
检查域名是否存在于网址中
if urlparse(url).netloc in ["domain1", "domain2", "domain3"]:
do something
通过正则表达式的简单解决方案
import re
def domain_name(url):
return url.split("www.")[-1].split("//")[-1].split(".")[0]
使用正则表达式,你可以使用这样的东西:
(?<=\.)([^.]+)(?:\.(?:co\.uk|ac\.us|[^.]+(?:$|\n)))
https://regex101.com/r/WQXFy6/5
注意,您必须留意特殊情况,例如
co.uk
。
import re
def getDomain(url:str) -> str:
'''
Return the domain from any url
'''
# copy the original url text
clean_url = url
# take out protocol
reg = re.findall(':[0-9]+',url)
if len(reg) > 0:
url = url.replace(reg[0],'')
# take out paths routes
if '/' in url:
url = url.split('/')
# select only the domain
if 'http' in clean_url:
url = url[2]
# preparing for next operation
url = ''.join(url)
# select only domain
url = '.'.join(url.split('.')[-2:])
return url
from urllib.parse import urlparse
import validators
hostnames = []
counter = 0
errors = 0
for row_orig in rows:
try:
row = row_orig.rstrip().lstrip().split(' ')[1].rstrip()
if len(row) < 5:
print(f"Empty row {row_orig}")
errors += 1
continue
if row.startswith('http'):
domain = urlparse(row).netloc # works for https and http
else:
domain = row
if ':' in domain:
domain = domain.split(':')[0] # split at port after clearing http/https protocol
# Finally validate it
if validators.domain(domain):
pass
elif validators.ipv4(domain):
pass
else:
print(f"Invalid domain/IP {domain}. RAW: {row}")
errors +=1
continue
hostnames.append(domain)
if counter % 10000 == 1:
print(f"Added {counter}. Errors {errors}")
counter+=1
except:
print("Error in extraction")
errors += 1
tests = {
"m.google.com": 'google',
"m.docs.google.com": 'google',
"www.someisotericdomain.innersite.mall.co.uk": 'mall',
"www.ouruniversity.department.mit.ac.us": 'mit',
"www.somestrangeurl.shops.relevantdomain.net": 'relevantdomain',
"www.example.info": 'example',
"github.com": 'github',
}
def get_domain(url, loop=0, data={}):
dot_count = url.count('.')
if not dot_count:
raise Exception("Invalid URL")
# basic
if not loop:
if dot_count < 3:
data = {
'main': url.split('.')[0 if dot_count == 1 else 1]
}
# advanced
if not data and '.' in url:
if dot_count > 1:
loop += 1
start = url.find('.')+1
end = url.rfind('.') if dot_count != 2 else None
return get_domain(url[start:end], loop, data)
else:
data ={
'main': url.split('.')[-1]
}
return data
for u, v in tests.items():
print(get_domain(u))