我必须检查数千个字符串,我需要获取包含instagram.com/p/
的完整网址
到目前为止我正在使用这种方法:
msg ='hello there http://instagram.com/p/BvluRHRhN16/'
msg = re.findall(
'http[s]?://?[\w/\-?=%.]+instagram.com/p/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
msg)
print(msg)
但是有一些它找不到的网址。
我想获得如下所有的网址:
https://instagram.com/p/BvluRHRhN16/
https://www.instagram.com/p/BvluRHRhN16/
http://instagram.com/p/BvluRHRhN16/
https://www.instagram.com/p/BvluRHRhN16/
www.instagram.com/p/BvluRHRhN16/
如何以最快的方式获得此结果?
url = '''
'hello there http://google.com/p/BvluRHRhN16/ this is a test',
'hello there https://www.instagram.com/p/BvluRHRhN16/',
'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
'''
from urlextract import URLExtract
extractor = URLExtract()
urls = extractor.find_urls(url)
print(urls)
输出:['http://google.com/p/BvluRHRhN16/','https://www.instagram.com/p/BvluRHRhN16/','www.instagram.com/p/BvluRHRhN16/','https://www.instagram.net/p/BvluRHRhN16/']
编辑:过滤网址
filtered = ([item for item in urls if "instagram.com/p/" in item])
print(filtered)
输出:['https://www.instagram.com/p/BvluRHRhN16/','www.instagram.com/p/BvluRHRhN16/']
我假设输入是一个包含URL的句子列表。希望这可以提供帮助。
msg =['hello there http://google.com/p/BvluRHRhN16/ this is a test',
'hello there https://www.instagram.com/p/BvluRHRhN16/',
'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
]
for m in msg:
ms = re.findall('(http.*instagram.+\/p.+|www.*instagram.+\/p.+)',m)
print(ms)
编辑正则表达式:
ms = re.findall('(http.*instagram\.com\/p.+\/|www.*instagram\.com\/p.+\/)',m)