正则表达式从文本中提取域，排除作为电子邮件地址一部分的域

Question

新来的。如有任何想法，将不胜感激。我需要一个正则表达式来从文本中提取域。文本将包含电子邮件和域名，提取电子邮件没有问题，但提取域名有点棘手。

例如，有一个由电子邮件和域名组成的文本： ” [电子邮件受保护]、google.com、www.msn.com、[电子邮件受保护]、[电子邮件受保护]、somesite.com、bbc.co.uk ”

使用正则表达式：“[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,4}”我提取电子邮件： [电子邮件受保护]、[电子邮件受保护]、[电子邮件受保护]。

好的，但无法找到/构建仅提取域的正则表达式，在这里找到了一些历史问题，但那些也将域作为电子邮件的一部分，但我只需要在域独立时提取域，而不是作为电子邮件地址的一部分。还考虑诸如“co.uk”或二级域+根域之类的域。只是为了提取： ” google.com、www.msn.com、bbc.co.uk ” 从上面给出的域电子邮件列表中。

任何想法表示赞赏。

谢谢

尝试在不同的聊天中找到答案，但没有具体说明我的问题。尝试过 CHatGPT，但还不够聪明，无法理解我到底需要什么。

Answer 1

import re

# Input text
text = " [email protected], google.com, www.msn.com, [email protected], [email protected], somesite.com, bbc.co.uk "

# Split the text on commas
items = [item.strip() for item in text.split(',')]

# Regex pattern for matching standalone domain names (not part of email addresses)
# This pattern ensures that what we capture does not include '@' before it and matches multiple subdomains/TLDs
pattern = r'\b(?<![@])((?:\w+|\w[\w\-]*\w)\.(?:\w+|\w[\w\-]*\w)(?:\.\w+){0,2})\b'

# Filter items and apply regex
domains = [item for item in items if re.fullmatch(pattern, item)]

print(domains)

正则表达式从文本中提取域，排除作为电子邮件地址一部分的域

问题描述投票：0回答：1

1个回答

最新问题

正则表达式从文本中提取域，排除作为电子邮件地址一部分的域

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1