使用Beautifulsoup下载文件时出错

问题描述 投票:0回答:1

我正在尝试使用Beautifulsoup从免费数据集中下载一些文件。我对网页中的两个相似链接重复相同的过程。

This是页面地址。

import requests
from bs4 import BeautifulSoup

first_url = "http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.region_xyz_centers_file.bcf53cd53a90f374.55434c415f43434e5f41504f455f4454495f41504f452d335f355f726567696f6e5f78797a5f63656e746572732e747874.txt" 
second_url="http://umcd.humanconnectomeproject.org/umcd/default/download/upload_data.connectivity_matrix_file.bfcc4fb8da90e7eb.55434c415f43434e5f41504f455f4454495f41504f452d335f355f636f6e6e6563746d61742e747874.txt"

def download_file(url, file_name):
    myfile = requests.get(url)
    open(file_name, 'wb').write(myfile.content)

download_file(first_url, "file1.txt")
download_file(second_url, "file2.txt")

输出文件:

file1.txt:
50.118248 53.451775 39.279296 
51.417612 67.443649 41.009074 
...
file2.txt
<html><body><h1>Internal error</h1>Ticket issued: <a href="/admin/default/ticket/umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562" target="_blank">umcd/89.41.15.124.2020-04-30.01-59-18.c02951d4-2e85-4934-b2c1-28bce003d562</a></body><!-- this is junk text else IE does not display the page: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx //--></html>

但是我可以从chrome浏览器中正确下载second_url(包含一些数字)。我试图设置用户代理

headers = {'User-Agent': "Chrome/6.0.472.63 Safari/534.3"}
r = requests.get(url, headers=headers)

但是没有用。

编辑该站点不需要登录即可获取数据。我在私有模式浏览器中打开了页面,然后将文件下载到了second_url中。直接应对地址栏中的second_url产生错误:

Internal error
Ticket issued: umcd/89.41.15.124.2020-04-30.03-18-34.49c8cb58-7202-4f05-9706-3309b581af76

你有什么主意吗?预先感谢您提供任何指导。

python beautifulsoup
1个回答
0
投票

这不是Python问题。第二个URL在Curl和我的浏览器中都给出相同的错误。

我很奇怪第二个URL会更短。您确定复制正确吗?

© www.soinside.com 2019 - 2024. All rights reserved.