从URL获取协议和域（没有子域）

Question

这是Get protocol + host name from URL的扩展，增加了我只想要域名而不是子域名的要求。

所以，例如，

Input: classes.usc.edu/xxx/yy/zz
Output: usc.edu

Input: mail.google.com
Output: google.com

Input: google.co.uk
Output: google.co.uk

有关更多上下文，我接受来自用户的一个或多个种子URL，然后在链接上运行scrapy搜寻器。我需要域名（没有子域名）来设置allowed_urls属性。

我也看了一下Python urlparse -- extract domain name without subdomain，但那里的答案似乎已经过时了。

我当前的代码使用urlparse，但这也得到了我不想要的子域名...

from urllib.parse import urlparse

uri = urlparse('https://classes.usc.edu/term-20191/classes/csci/')
f'{uri.scheme}://{uri.netloc}/'
# 'https://classes.usc.edu/'

是否有（希望是stdlib）获取（仅）python-3.x中的域的方法？

Answer 1

我正在使用tldextract当我进行域解析时。

在你的情况下，你只需要结合domain + suffix

import tldextract
tldextract.extract('mail.google.com')
Out[756]: ExtractResult(subdomain='mail', domain='google', suffix='com')
tldextract.extract('classes.usc.edu/xxx/yy/zz')
Out[757]: ExtractResult(subdomain='classes', domain='usc', suffix='edu')
tldextract.extract('google.co.uk')
Out[758]: ExtractResult(subdomain='', domain='google', suffix='co.uk')

从URL获取协议和域（没有子域）

问题描述投票：3回答：1

1个回答

最新问题

从URL获取协议和域（没有子域）

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1