如何检测并“修复”文本正文中包含空格的 URL？

Question

假设我提交了以下文本。请注意大多数网址中的空格：

根据 NASA (htt ps://www.nasa.gov) 和纽约时报 (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration) 的说法，科学家们正在取得很多新发现！有各种令人兴奋的新发现。 ArXiv 的
astro-ph
类别 (https://arxiv.org/list/astro-ph.GA/new) 列出了一系列正在进行的新研究。幸运的我！谷歌搜索（https://www.google.com/）发现了更多新发现！

我想检测 URL 并使用 Python 将它们替换为正确的 URL。

本文中的固定网址为：

并非所有网址都有空格。有些 URL 具有多个空格。空格可以位于 URL 的任何部分。我可以假设 URL 是 Web URL (http/https)。我想我可以假设只有空格（没有制表符或换行符）。我想我可以假设不会有超过一个连续的空格。我想我可以假设标记/单词不会被空格打破 - 换句话说，空格将位于标点符号旁边。

注意：我的问题与这个问题类似，只不过我希望修复的 URL 是书面文本，空格可能位于 URL 的任何部分，并且我将自己限制为 Web URL。

注意：我目前正在使用优秀的（如果有点过分）Liberal Regex Pattern for Web URLs here，但它似乎不足以完成这项工作。

注意：我需要检测和替换 URL。为了我自己的使用，我扫描文本并将其转换为 LaTeX。 URL 通过

\href{}{}

命令转换为超链接。为此，我需要检测好的和坏的 URL，修复任何坏的 URL，使用正确的 URL 创建超链接，然后将原始的好或坏的 URL 替换为文本正文中更正的 URL。

Answer 1

我假设 URL 位于

和

之间。然后你可以尝试使用

re

模块和

urlparse()

来检查 URL：

import re
from urllib.parse import urlparse

text = """\
According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The astro-ph category of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!
"""

pat = r"\(\s*(h\s*t\s*t\s*p[^)]+)"

for url in re.findall(pat, text):
    url = url.replace(" ", "")

    # try to parse the URL:
    try:
        urlparse(url)
    except ValueError:
        continue

    print(url)

打印：

https://www.nasa.gov
https://www.nytimes.com/topic/organization/national-aeronautics-and-space-administration
https://arxiv.org/list/astro-ph.GA/new
https://www.google.com/

如何检测并“修复”文本正文中包含空格的 URL？

问题描述投票：0回答：1

1个回答

最新问题

如何检测并“修复”文本正文中包含空格的 URL？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1