我正在尝试使用Beautifulsoup从免费数据集中下载一些文件。我对网页中的两个相似链接重复相同的过程。
This是页面地址。
import requests
from bs4 import BeautifulSoup
first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt"
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"
def download_file(url, file_name):
myfile = requests.get(url)
open(file_name, 'wb').write(myfile.content)
download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")
输出文件:
file1.txt:
50.118248 53.451775 39.279296
51.417612 67.443649 41.009074
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: <a href="/admin/default/ticket/umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562" target="_blank">umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</a></body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>
但是我可以从chrome浏览器中正确下载second_url(包含一些数字)。我试图设置用户代理
headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)
但是没有用。
编辑该站点不需要登录即可获取数据。我在私有模式浏览器中打开了页面,然后将文件下载到了second_url中。直接应对地址栏中的second_url产生错误:
Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76
你有什么主意吗?预先感谢您提供任何指导。
这不是Python问题。第二个URL在Curl和我的浏览器中都给出相同的错误。
我很奇怪第二个URL会更短。您确定复制正确吗?