获取字符串内部URL的最快方法

问题描述 投票:0回答:2

我必须检查数千个字符串,我需要获取包含instagram.com/p/的完整网址

到目前为止我正在使用这种方法:

msg ='hello there http://instagram.com/p/BvluRHRhN16/'
msg = re.findall(
            'http[s]?://?[\w/\-?=%.]+instagram.com/p/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
            msg)
print(msg)

但是有一些它找不到的网址。

我想获得如下所有的网址:

https://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ http://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ www.instagram.com/p/BvluRHRhN16/

如何以最快的方式获得此结果?

python regex findall
2个回答
1
投票
url = '''
'hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
'''

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls(url)
print(urls)

输出:['http://google.com/p/BvluRHRhN16/','https://www.instagram.com/p/BvluRHRhN16/','www.instagram.com/p/BvluRHRhN16/','https://www.instagram.net/p/BvluRHRhN16/']

编辑:过滤网址

filtered = ([item for item in urls if "instagram.com/p/" in item])

print(filtered)

输出:['https://www.instagram.com/p/BvluRHRhN16/','www.instagram.com/p/BvluRHRhN16/']


1
投票

我假设输入是一个包含URL的句子列表。希望这可以提供帮助。

msg =['hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
     ]

for m in msg:
    ms = re.findall('(http.*instagram.+\/p.+|www.*instagram.+\/p.+)',m)
    print(ms)

编辑正则表达式:

ms = re.findall('(http.*instagram\.com\/p.+\/|www.*instagram\.com\/p.+\/)',m)
© www.soinside.com 2019 - 2024. All rights reserved.