当我尝试使用 urllib 发送请求时,出现 InvalidURL: URL can't contains control characters

问题描述 投票:0回答:8

我试图从用作 urllib 请求参数的链接获取 JSON 响应。但它给了我一个错误,它不能包含控制字符。

我该如何解决这个问题?

start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
    
source = urllib.request.urlopen(start_url).read()

我得到的错误是:

URL can't contain control characters. '/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq=' (found at least ' ')
python web-scraping beautifulsoup urllib
8个回答
24
投票

将空格替换为:

url = url.replace(" ", "%20")

如果问题出在空格上。


5
投票

URL 中不允许有空格,我删除了它们,现在似乎可以正常工作了:

import urllib.request
start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
url = start_url.replace(" ","")
source = urllib.request.urlopen(url).read()

4
投票

Solr 搜索字符串可能会变得非常奇怪。在发出请求之前最好使用“quote”方法对字符进行编码。请参阅下面的示例:

from urllib.parse import quote

start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
    
source = urllib.request.urlopen(quote(start_url)).read()

迟做总比不做好...


2
投票

你现在可能已经发现了,但让我们把它写在这里。

URL 中不能有任何 space 字符,bundle_fq e dm_field_deadlineTo_fq

之后有 2 个

删除这些就可以了


1
投票

先解析 url,然后对 url 元素进行编码即可。

import urllib.request
from urllib.parse import urlparse, quote

def make_safe_url(url: str) -> str:
    """
    Returns a parsed and quoted url
    """
    _url = urlparse(url)
    url = _url.scheme + "://" + _url.netloc + quote(_url.path) + "?" + quote(_url.query)
    return url

start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project_name_fq=&label_fq=&sm_field_db_ref_no__fq=&sm_field_loan_no__fq=&dm_field_deadlineFrom_fq=&dm_field_deadlineTo_fq =&ds_field_future_posting_dateFrom_fq=&ds_field_future_posting_dateTo_fq=&bm_field_individual_consulting_fq="
start_url = make_safe_url(start_url)
source = urllib.request.urlopen(start_url).read()

尽管 url 中存在双斜杠和空格,代码仍返回 JSON 文档。


0
投票

就像错误消息所说,您的网址中有一些 控制字符,顺便说一句,这似乎不是有效的。


0
投票

您需要对 URL 内的控制字符进行编码。特别是空格需要编码为 %20。


-2
投票
请检查代理地址是否有

空白,

由于空格,我们收到此错误。

© www.soinside.com 2019 - 2024. All rights reserved.