Webscraping Job Listings
(Note that this post is going to be pretty technical. We’ll return to our regular programming of opinionated backpacking nonsense shortly.)
There are very few activities that I can properly be described as “hating”. My normal level of dislike tops out at something like “meh”. But looking for a job is something that I really, really do not enjoy doing. As a card-carrying member of the nerd class, the obvious solution is to make my computer do the searching for me. To that end, I spent a couple weeks build a script called Jobhunt that would scrape the job listing pages of sites that were of interest to me, and send me an email if there were any new listings in my area. Since some of the information I used/learned is somewhat obscure, I’m archiving my process here, for the benefit of future generations.
Note that if you want me to shut up and just give you the code, here it is.
Some Background
Currently I work as a university lecturer, but that’s part-time, and doesn’t provide benefits. I’m looking for a full-time position at a community college, teaching computer science. (I have a background in IT support, so I wouldn’t mind teaching a bit of that, but in no way do I want to teach IT full time. No.) So to a first approximation, my “target” was going to be the job listing pages for community colleges.
Originally I aimed to target only CCs in California, because most of my family is still in the Central Valley. But after I started to look at how far it would be to some of those schools, it made sense to expand my search. Once you get beyond (say) a four-hour drive home, any trip home becomes an Event, something that has to be planned out in advance. But once you get beyond that limit, it really doesn’t matter how far you go. So I added Oregon’s and Washington’s CCs to the list. Note that I am only looking at fully-fledged community colleges, not technical schools, although I am including schools that do not currently have a fully-fledge computer science program; they might add one, and then they’ll need someone to run it, right?
California, Oregon, and Washington all have websites listing all the CCs in their states:
Note that California and Washington both have central listings of all CC jobs. I opted to scrape California’s listing, as it seemed to be kept relatively up-to-date, and also because California has quite a few CCs. As far as I know, Oregon does not have such a listing. I wasn’t sure how accurate or fresh Washington’s list was, so there I opted instead to scrape the individual colleges’ pages.
Job Listing Pages
Some colleges roll their own job-listing pages. These are the easiest to deal with, as they are typically quite un-fancy: just an HTML table or maybe an unordered list with some styling. It’s relatively easy to use CSS selectors to pull out the relevant information, which is exactly what Jobhunt does: in the configuration for those colleges, there is a set of CSS selectors that tells the system how to extract each individual listing, and then for each listing, how to extract the relevant information from it (job title, location, salary, and closing date). This was generally the easiest part of the process, even though it had to be done manually for each school.
Most schools, however, go with an off-the-shelf solution for running their job-listing pages (and sometimes their entire application process). There were three such systems that I found:
NeoGOV: The NeoGOV people run http://governmentjobs.com which apparently just collects all the listings submitted by the individual organizations submit through the system. Although I could have just scraped the central site, I decided not to, partly because the individual listings often had custom columns in their datasets with extra information. (And also because it was more fun that way.) Scraping NeoGOV is not particularly tricky, although they do split up their job listings into multiple pages, and later pages can only be retrieved by sending a POST request. But no information is persisted between requests, so you can just grab the first page, scrape the number of pages, and then blindly POST for all the subsequent pages.
HRSuite: The easiest system to deal with, HRSuite returns all its results as a list of
div
s, all with nice, consistent classes identifying the various fields. And it takes a single GET to retrieve the initial listing.PeopleAdmin: PeopleAdmin is a crime against humanity, and the developers responsible for it should be on trial at the Hague. It’s a JSP application that maintains some state on the client (some as cookies, some as URL parameters, some as JS variables embedded in the page) and some (invisibly) on the server. The server-side state is vital to the functioning of the application; screw up any request and every request after that will fail. Pages are timestamped, and subsequent requests must contain a consistent time-stamp. The “protocol” is a mixture of GET and POST, and includes such beauties as following redirects that are defined in Javascript. I.e., you will issue a request for a page, and will get back an HTTP 200 page containing a script that will redirect you to the page you actually wanted to go to. So you have to scrape the JS of the page to find the URL and parameters you are being redirected to. The entire thing is just a hateful system, and I would have strong doubts about the competency of any IT department that would deploy it. (But knowing how these things work, it was most likely chosen by bid by the HR department, with little to no actual input from IT until the process was done.)
For my full notes on the different systems, see this document.
Jobhunt Architecture
Jobhunt consists of a main Python script together with a collection of “plugins” which are responsible for scraping the various kinds of job-listing sites (custom, HRSuite, NeoGOV, and PeopleAdmin). There’s a configuration file that lists all the schools, as well as what plugin to use for each and some other required information, along with details about the user’s email address and SMTP server (the script sends you an email after it runs with a report if there are any new listings). When the script runs, it loads the list of school websites, and fires off an instance of the appropriate plugin. It collects all the results and saves them to a JSON file; if there are any new listings, then a report is generated and emailed.
Jobhunt relies on BeautifulSoup for its HTML parsing voodoo. One thing to be aware of when using BeautifulSoup is that it will use the “best” available HTML parser, which may be different for different machines. I was confused as to why I was getting different results on my home server, from my development machine, and it was because I had different parser modules installed.
Results
You might ask whether all my effort has paid off, in terms of actually getting me a job. Well, I’m not actually looking yet! I’m committed to teaching in the spring semester, so I won’t be available until after that. (It would be rather hypocritical of me to go interview at another school, claiming that “I’ll make your students’ success my top priority!” while knowing full well that I was thwarting the success of the students at the school I teach at by doing so.) So once springtime rolls around and positions start appearing for the following fall, I’ll start taking them seriously.
(Actually, I have seen a number of interesting positions pop up. So interesting that I was rather disappointed they appeared now, instead of later when I can actually apply for them.)