获取字符串内部URL的最快方法

Question

我必须检查数千个字符串，我需要获取包含instagram.com/p/的完整网址

到目前为止我正在使用这种方法：

msg ='hello there http://instagram.com/p/BvluRHRhN16/'
msg = re.findall(
            'http[s]?://?[\w/\-?=%.]+instagram.com/p/(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
            msg)
print(msg)

但是有一些它找不到的网址。

我想获得如下所有的网址：

https://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ http://instagram.com/p/BvluRHRhN16/ https://www.instagram.com/p/BvluRHRhN16/ www.instagram.com/p/BvluRHRhN16/

如何以最快的方式获得此结果？

Answer 1

url = '''
'hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
'''

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls(url)
print(urls)

输出：['http://google.com/p/BvluRHRhN16/'，'https://www.instagram.com/p/BvluRHRhN16/'，'www.instagram.com/p/BvluRHRhN16/'，'https://www.instagram.net/p/BvluRHRhN16/']

编辑：过滤网址

filtered = ([item for item in urls if "instagram.com/p/" in item])

print(filtered)

输出：['https://www.instagram.com/p/BvluRHRhN16/'，'www.instagram.com/p/BvluRHRhN16/']

Answer 2

我假设输入是一个包含URL的句子列表。希望这可以提供帮助。

msg =['hello there http://google.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.com/p/BvluRHRhN16/',
      'hello there www.instagram.com/p/BvluRHRhN16/ this is a test',
      'hello there https://www.instagram.net/p/BvluRHRhN16/ this is a test'
     ]

for m in msg:
    ms = re.findall('(http.*instagram.+\/p.+|www.*instagram.+\/p.+)',m)
    print(ms)

编辑正则表达式：

ms = re.findall('(http.*instagram\.com\/p.+\/|www.*instagram\.com\/p.+\/)',m)

获取字符串内部URL的最快方法

问题描述投票：0回答：2

2个回答

最新问题

获取字符串内部URL的最快方法

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2