使用 spaCy 检测 URL

问题描述 投票:0回答:1
import spacy
text = "Schedule time with me Google<https://nam02.safelinks.protection.outlook.com/?url=http://calendly.com/helloKitty>"
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer.explain(text)

输出

[('TOKEN', 'Schedule'),
 ('TOKEN', 'time'),
 ('TOKEN', 'with'),
 ('TOKEN', 'me'),
 ('TOKEN', 'Google'),
 ('INFIX', '<'),
 ('TOKEN', 'https://nam02.safelinks.protection.outlook.com/?url'),
 ('INFIX', '='),
 ('TOKEN', 'http://calendly.com'),
 ('INFIX', '/'),
 ('TOKEN', 'helloKitty'),
 ('SUFFIX', '>')]

我知道我传入的数据格式错误,但我希望我的数据始终采用

alphaNumericCharacters<https
格式。解决此问题的一种解决方案是在
<
之前添加一个空格,但由于我确实关心匹配发生的确切索引,因此这会影响数据的完整性。

在不影响数据完整性的情况下将 URL 提取为单个 URL 的最佳方法是什么?


更新:

我试图检测的实际 URL 比我发布的要混乱得多(这些 URL 中嵌入了敏感数据,因此犹豫是否在此处发布,对此感到抱歉;我在下面制作了更具代表性的虚拟 URL)。

与我想要匹配的 URL 匹配的正则表达式是

(?<=<)https?.*?(?=>)
。可以在here

找到一个工作示例

注意:我需要正常的

token.like_url
才能按预期工作,因为这只是我试图添加到检测过程中的网址的一种边缘情况。

John Smith
CEO| Github
Website<https://asia04.safelinks.protection.outlook.com/?url=https://github.com/?utm_source=Email&utm_medium=Signature&data=04|01|[email protected]|e96d7cd187574884454708f8772e91g2|cd1e96k8h3da442a930b235cac54cd5c|0|0|756390346193879231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1010&sdata=JULP0n6z7CuNKTqTm1uNje6LV2VyPZ2h1C43m5gHWTs=&reserved=0> | Twitter<https://asia04.safelinks.protection.outlook.com/?url=https://twitter.com/Github&data=04|01|[email protected]|d07d7Cb187574996654709d8772e92f3|cd6a8a3da442a930b235cac24db5c|0|0|66789642193359231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=lZ+LJJkALu4aI1mq6xllSFGGscukF1tc2bLJi4Ys9LE=&reserved=0> | LinkedIn<https://asia04.safelinks.protection.outlook.com/?url=https://www.linkedin.com/company/github/&data=04|01|[email protected]|d06d7ab187574664454709d8662e81f3|cd1e96a8a3da442a904b235cac24db5c|0|0|657480399193999229|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLDMBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=F1CDkxQBSHCKyG9oosromvwaLqmaB/KrgyysnSGy/SM=&reserved=0>
import spacy
from spacy.lang.tokenizer_exceptions import URL_PATTERN
text = "John Smith\nCEO| Github\nWebsite<https://asia04.safelinks.protection.outlook.com/?url=https://github.com/?utm_source=Email&utm_medium=Signature&data=04|01|[email protected]|e96d7cd187574884454708f8772e91g2|cd1e96k8h3da442a930b235cac54cd5c|0|0|756390346193879231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1010&sdata=JULP0n6z7CuNKTqTm1uNje6LV2VyPZ2h1C43m5gHWTs=&reserved=0> | Twitter<https://asia04.safelinks.protection.outlook.com/?url=https://twitter.com/Github&data=04|01|[email protected]|d07d7Cb187574996654709d8772e92f3|cd6a8a3da442a930b235cac24db5c|0|0|66789642193359231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=lZ+LJJkALu4aI1mq6xllSFGGscukF1tc2bLJi4Ys9LE=&reserved=0> | LinkedIn<https://asia04.safelinks.protection.outlook.com/?url=https://www.linkedin.com/company/github/&data=04|01|[email protected]|d06d7ab187574664454709d8662e81f3|cd1e96a8a3da442a904b235cac24db5c|0|0|657480399193999229|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLDMBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=F1CDkxQBSHCKyG9oosromvwaLqmaB/KrgyysnSGy/SM=&reserved=0>"
nlp = spacy.load("en_core_web_sm")
custom_infixes = [URL_PATTERN[1:-1]] + list(nlp.Defaults.infixes)
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(custom_infixes).finditer
for t in nlp.tokenizer.explain(nlp(text).text):
    print(t)

输出

('TOKEN', 'John')
('TOKEN', 'Smith')
('TOKEN', 'CEO|')
('TOKEN', 'Github')
('INFIX', 'Website<https://asia04.safelinks.protection.outlook.com/?url=https://github.com/?utm_source=Email&utm_medium=Signature&data=04|01|[email protected]')
('TOKEN', '|e96d7cd187574884454708f8772e91g2|cd1e96k8h3da442a930b235cac54cd5c|0|0|756390346193879231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1010&sdata')
('INFIX', '=')
('TOKEN', 'JULP0n6z7CuNKTqTm1uNje6LV2VyPZ2h1C43m5gHWTs=&reserved=0')
('SUFFIX', '>')
('TOKEN', '|')
('INFIX', 'Twitter<https://asia04.safelinks.protection.outlook.com/?url=https://twitter.com/Github&data=04|01|[email protected]')
('TOKEN', '|d07d7Cb187574996654709d8772e92f3|cd6a8a3da442a930b235cac24db5c|0|0|66789642193359231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata')
('INFIX', '=')
('TOKEN', 'lZ+LJJkALu4aI1mq6xllSFGGscukF1tc2bLJi4Ys9LE=&reserved=0')
('SUFFIX', '>')
('TOKEN', '|')
('INFIX', 'LinkedIn<https://asia04.safelinks.protection.outlook.com/?url=https://www.linkedin.com/company/github/&data=04|01|[email protected]')
('TOKEN', '|d06d7ab187574664454709d8662e81f3|cd1e96a8a3da442a904b235cac24db5c|0|0|657480399193999229|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLDMBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata')
('INFIX', '=')
('TOKEN', 'F1CDkxQBSHCKyG9oosromvwaLqmaB')
('INFIX', '/')
('TOKEN', 'KrgyysnSGy')
('INFIX', '/')
('TOKEN', 'SM=&reserved=0')
('SUFFIX', '>')

我尝试将这种情况下的自定义 url 正则表达式添加到中缀中,但这不起作用。

import spacy
from spacy.lang.tokenizer_exceptions import URL_PATTERN
text = "John Smith\nCEO| Github\nWebsite<https://asia04.safelinks.protection.outlook.com/?url=https://github.com/?utm_source=Email&utm_medium=Signature&data=04|01|[email protected]|e96d7cd187574884454708f8772e91g2|cd1e96k8h3da442a930b235cac54cd5c|0|0|756390346193879231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1010&sdata=JULP0n6z7CuNKTqTm1uNje6LV2VyPZ2h1C43m5gHWTs=&reserved=0> | Twitter<https://asia04.safelinks.protection.outlook.com/?url=https://twitter.com/Github&data=04|01|[email protected]|d07d7Cb187574996654709d8772e92f3|cd6a8a3da442a930b235cac24db5c|0|0|66789642193359231|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=lZ+LJJkALu4aI1mq6xllSFGGscukF1tc2bLJi4Ys9LE=&reserved=0> | LinkedIn<https://asia04.safelinks.protection.outlook.com/?url=https://www.linkedin.com/company/github/&data=04|01|[email protected]|d06d7ab187574664454709d8662e81f3|cd1e96a8a3da442a904b235cac24db5c|0|0|657480399193999229|Unknown|WTFpbZGsb3d8eyQWIkoiDC4wLPjJmMDAiLCJQIjoiV2luMzIiLDMBTiI6Ik1haWwiLDJZVCD6Mn0=|1000&sdata=F1CDkxQBSHCKyG9oosromvwaLqmaB/KrgyysnSGy/SM=&reserved=0>"
nlp = spacy.load("en_core_web_sm")
custom_infixes = [r'(?<=<)https?.*?(?=>)']  + list(nlp.Defaults.infixes)
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(custom_infixes).finditer
for t in nlp.tokenizer.explain(nlp(text).text):
    print(t)
python spacy
1个回答
0
投票
import spacy
nlp= spacy.blank("en")
doc=nlp(text)
url_list=[token.text for token in doc if token.like_url]
print(url_list)
#try it
© www.soinside.com 2019 - 2024. All rights reserved.