我正在尝试从此站点http://cheese.formice.com/maps/@5865339获取信息,但是当我请求使用urllib.urlopen时,它说我需要登录,我正在使用此代码:
import urllib
data = {
'login':'Cfmaccount',
'password':'tfmdev321',
'submit':'Login',
}
url = 'http://cheese.formice.com/login'
data = urllib.urlencode(data)
response = urllib.urlopen(url, data)
我做错了什么?
它不是直接使用urllib
,但是您可能会发现使用requests
包更容易。 requests
有一个requests
对象session
see this answer
这将使您登录到该站点。您可以通过以下方式进行验证:
import requests
url = 'http://cheese.formice.com/forum/login/login'
login_data = dict(login='Cfmaccount', password='tfmdev321')
session = requests.session()
r = session.post(url, data=login_data)
登录后,您可以调用所需的特定网址。
print r.text #prints the <html> response.
我是一名新的Pyhton 3“学生”,正在寻找这个,就像@ArthurQ一样,我想使用“只是urllib”完成它……或者至少仅使用标准库中提供的工具,没有依赖项。我花了整整一天的时间才终于破解它,由于这个问题是我进行各种搜索的最佳结果(几次!),我希望我的回答会帮助其他人寻找此解决方案。我要说的是,每个人都强制执行“请求”库,而我没有找到一个能用标准库解决整个问题的单一答案。足够多的聊天,下面是代码,我添加了很多注释,所以让代码自己讲:
r2 = session.get('http://cheese.formice.com/maps/@5865339')
print r2.content #prints the raw html you can now parse and scrape
# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar
def scraper_login():
####### change variables here, like URL, action URL, user, pass
# your base URL here, will be used for headers and such, with and without https://
base_url = 'www.example.com'
https_base_url = 'https://' + base_url
# here goes URL that's found inside form action='.....'
# adjust as needed, can be all kinds of weird stuff
authentication_url = https_base_url + '/login'
# username and password for login
username = 'yourusername'
password = 'SoMePassw0rd!'
# we will use this string to confirm a login at end
check_string = 'Logout'
####### rest of the script is logic
# but you will need to tweak couple things maybe regarding "token" logic
# (can be _token or token or _token_ or secret ... etc)
# big thing! you need a referer for most pages! and correct headers are the key
headers={"Content-Type":"application/x-www-form-urlencoded",
"User-agent":"Mozilla/5.0 Chrome/81.0.4044.92", # Chrome 80+ as per web search
"Host":base_url,
"Origin":https_base_url,
"Referer":https_base_url}
# initiate the cookie jar (using : http.cookiejar and urllib.request)
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
# first a simple request, just to get login page and parse out the token
# (using : urllib.request)
request = urllib.request.Request(https_base_url)
response = urllib.request.urlopen(request)
contents = response.read()
# parse the page, we look for token eg. on my page it was something like this:
# <input type="hidden" name="_token" value="random1234567890qwertzstring">
# this can probably be done better with regex and similar
# but I'm newb, so bear with me
html = contents.decode("utf-8")
# text just before start and just after end of your token string
mark_start = '<input type="hidden" name="_token" value="'
mark_end = '">'
# index of those two points
start_index = html.find(mark_start) + len(mark_start)
end_index = html.find(mark_end, start_index)
# and text between them is our token, store it for second step of actual login
token = html[start_index:end_index]
# here we craft our payload, it's all the form fields, including HIDDEN fields!
# that includes token we scraped earler, as that's usually in hidden fields
# make sure left side is from "name" attributes of the form,
# and right side is what you want to post as "value"
# and for hidden fields make sure you replicate the expected answer,
# eg. "token" or "yes I agree" checkboxes and such
payload = {
'_token':token,
# 'name':'value', # make sure this is the format of all additional fields !
'login':username,
'password':password
}
# now we prepare all we need for login
# data - with our payload (user/pass/token) urlencoded and encoded as bytes
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
# and put the URL + encoded data + correct headers into our POST request
# btw, despite what I thought it is automatically treated as POST
# I guess because of byte encoded data field you don't need to say it like this:
# urllib.request.Request(authentication_url, binary_data, headers, method='POST')
request = urllib.request.Request(authentication_url, binary_data, headers)
response = urllib.request.urlopen(request)
contents = response.read()
# just for kicks, we confirm some element in the page that's secure behind the login
# we use a particular string we know only occurs after login,
# like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
contents = contents.decode("utf-8")
index = contents.find(check_string)
# if we find it
if index != -1:
print(f"We found '{check_string}' at index position : {index}")
else:
print(f"String '{check_string}' was not found! Maybe we did not login ?!")
scraper_login()