Colgate Schedule

Post by Saul Shanabrook

When I got to Colgate, I signed up for all my classes and got a nice schedule overview on their student portal.

portal colgate class schedules

I have become a little addicted to Google Calendar, so I obviously wanted to add all of my classes. However, it seemed a little boring to parse all those day of the week/time/classes and add repeating events. Why not let a computer do it? They are in a machine readable format, obviously in some database anywhere to begin with, it shouldn't be too hard to scrape that and generate an iCalendar (.ics) file with all of my classes, that is importable into any calendar app.

You can see tha app live at—

I just tried tried to access the site and got:

phishing warning by colgate at colgate-schedule.herokuapp.com

I am impressed with their IT, they are sorta right, it is horribly insecure, I will extrapolate later, so it is probably good they blocked it. Anyway, you can probably see it outside the Colgate network at colgate-schedule.herokuapp.com and the code is up at saulshanabrook/colgate-schedule.

Process

It is so insecure because it is has a Python backend. I originally intended to write an only client side app, where you input your colgate username and password and it requests their site, scrapes your schedule, parses it, and creates a .ics file on the fly to download. That way it wouldn't really be any more insecure than logging into the portal in the first place. However, I got lazy, so I reverted back to my good old friend Python (Requests and BS4 for scraping).

Oh also, you can just create your own heroku app from my template and run it through your own account. Then at least you won't be giving me your username and password.

So obviously it needs to be rewritten in Javascript, but, you know how these things go, it works for me right now, so if you want it better, send a pull request!

(Not So Simple) Scraping

It was much more annoying to scrape my schedule than I thought it would be. There are two ways I considered for scraping a webpage in Python.

  1. Use Requests plus Beautiful Soup 4 to get each page, as you need it, building up any cookies you need on the wya, so that the server thinks you are human enough and authenticated enough to get your final page.
  2. Use a full Selenium in Python to emulate a full blown browser. Instead of making specific GET and POST requests, act more like a user by filling in forms and pressing buttons.

I went for option 1, because although option 2 seems like it would be less code, it has a couple disadvantages. Emulating user input can be more opaque than sending specific requests. For example, you aren't really sure what you are doing that is required and what is just extra. Did you need to download all those style and javascript files just to fill in that form? Did you need to make that extra request? I chose instead to try to figure out exactly how the server knows someone is authentecated and emulate that, which is harder than I thought it would be.

So using the second method breaks down into a couple of steps.

  1. Authenticate with the portal, so you seem like a nice logged in user.
  2. Get HTML of the page with the schedule on it.
  3. Parse that HTML for the events
  4. Turn that parsed event data into an .ics.

For the first, I was hoping to send a simple POST with a username and password to the but no. First off, you need to send a GET to the URL first, to get a session cookie. Not that bad. But I also realized that the form page has a hidden field with a value that is generated server side. So unless your POST request includes that value that was included in the form on the page, it won't authenticate. So I had scrape the value (called lt) out of the input element in the form page, and then include it in my post request. Here is my code (also link to github source):

def login_to_portal(session, username, password):  
    LOGIN_URL = 'https://cas.colgate.edu/cas/login'

    # to get sessions id cookies `JSESSIONID`
    r = session.get(LOGIN_URL)
    session_id = session.cookies['JSESSIONID']

    # to get hidden lt value that needs to be sent with form
    soup = bs4.BeautifulSoup(r.text)
    lt = soup.find('input', attrs={'name': "lt"})['value']

    # to login
    post_url = LOGIN_URL + ';jsessionid=' + session_id
    data = {
        'username': username,
        'password': password,
        'lt': lt,
        'execution': 'e1s1',
        '_eventId': 'submit',
    }
    r = session.post(post_url, data=data)

For the second step, another weirdness was that I couldn't just then submit a GET to the schedule page. First I had to get some other page first, to get another session cookie, and then it would allow me to get the schedule page.

On step three, parsing the HTML was not very hard, but I there were a few hiccups. First, the class times didn't list an AM or PM, so I had to guess, based on what time of day it is, which has already led to some wrong guesses.

Step four was a lot easier than I thought it would be, using the ics.py library.

Conclusion

All in all, pretty fun. I love the heroku app deploy button, allowing anyone to run their own version for free. That is such a great example of doing something that is technologically very hard, but makes such a great difference in terms of workflow. Even someone without any coding knowledge can deploy their own version, without touching the command line.

I think using web scraping + api generation can often times be a killer combo to get more utility out of pre-existing data. The Requests, BS4, and Flask combo seems to work very well for that.