0

I need to login into [a website] and use the session to scrape some data. However when using POST, I always get status 404.

Here is what I have already tried:

import requests
PW="password"
UN="username"
payload={"Login":UN,"Password":PW,"submit":"Kirjaudu+sisään"}
url="[a website]"
s=requests.session()
data=s.post(url,data=payload)
print(data)

The output is:

<Response [404]>

I have also tried supplying a Firefox user agent for the site:

s.post(url,data=payload,headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"})

It did not make a difference.

KJKalle
  • 47
  • 6
  • the url is wrong I think, I just tried, and I see `https://wilma-lukiot.gradia.fi/login`, not `https://wilma-lukiot.gradia.fi` – dgan Apr 03 '19 at 13:48

1 Answers1

1

Firstly, the post requests should go to https://wilma-lukiot.gradia.fi/login

Secondly, there is a fourth field in the form, a SESSIONID, you need to send that too.
Probably the best way to get it is first load https://wilma-lukiot.gradia.fi, parse it to get the SESSIONID, just then send a post (in the same session) to the login endpoint.

edinho
  • 406
  • 2
  • 6
  • Now I got it working, thank you very much! I thought requests.session() managed session ids, but that SESSIONID seems to be something unique to this site? Also how did you know the login path should be used? – KJKalle Apr 03 '19 at 14:16
  • Indeed `requests.Session` manages session ids. But, this field has nothing to do with session, it is just a hidden field in the form (works like django [crsf](https://docs.djangoproject.com/en/2.2/ref/csrf/) if you want to know more). The login path is sent and is available when you look the network tab [(nice explanation here)](https://stackoverflow.com/questions/15603561/how-can-i-debug-a-http-post-in-chrome) – edinho Apr 03 '19 at 14:22