Data Dailies
💾 Updated on May 31, 2020

For many problems/tasks, even if you already have access to rich a data set, additional data sources can augment and add context to your analysis. And given that it is 2020[1] the wonderful WWW is likely where you will find the most interesting data. As a data scientist, feeling confident programatically searching and downloading new sources of data can go a long way and hopefully will start to feel like a minor superpower.....

[1] I think 🧐
DISCLAIMER: Always be responsible, ethical, and polite when web scraping. If it feels like you are doing something questionable, you should probably stop.
  1. Downloading vs. Parsing
  2. HTTP Requests
  3. HTTP.jl
    1. GET vs. POST
    2. Responses
  4. References and Extras

Downloading vs. Parsing

I like to conceptualize a web scraping task in two distinct phases: downloading and parsing. In the downloading phase we are really just concerned with having a semi-automated programatic way to acquire raw data (whether that be HTML, JSON, plain text, etc.). In the downloading phase we should treat data as a binary blob – just some bytes we need to get from a server somewhere to our computer.

Once we have this raw data however, it is in the parsing phase that we add meaning to it by imposing some structure (and human semantics – i.e. column 3 corresponds to the number of total COVID tests). Now even if there is some inevitable abstraction leakage that happens (and might be necessary) between these two phases, you should still think of the web scraping process as these two distinct tasks.

Abstracting each phase can make things much more easy to debug/troubleshoot but also make the code more extensible if you want to download additional sources (but parse every source identically).

HTTP Requests

We already saw yesterday how to use the download funtion in Julia Base to download a single file from a url to our local machine. And thankfully for us yesterday, the url we needed was predictable and the file well formatted (CSVs and JSON). Since we were working with a well structed API (the COVID Tracking Project), the site was designed to facilitate data dissemination. Other times however, you might want data that the host/owner either:

And in these more difficult situations, knowing how to scrape a data source on the Internet can be invaluable.

While download() can actually get us pretty far (it just calls out to the OS's curl, wget, or fetch), the abstraction Julia provides on these utilities hides all of their options and only gives you the ability to specify a url. The Julia package ecosystem never disappoints though 🙌

The HTTP.jl package is a fairly well worn library that lets us use Julia to make (and recieve) HTTP requests. Without getting into the nitty gritty of internet protocols, all you really need to know for now is that HTTP requests are what are sent from your web browser to a remote server when you want to view a web site[2]. In the parlance of our downloading vs. parsing section, the web browser does double duty:

[2] A more comprehensive treatment of HTTP, the web, and HTML would probably be good but unfortunately we will have to wait until another day.
  1. It first downloads the raw HTML text data from the server (this is the HTTP request)

  2. Once it receives the HTML text, the web browser software (i.e. Firefox, Chrome, Safari, etc.) parses the HTML text and converts it into a graphical display to show you.

We will use HTTP.jl to do step 1 above and tommorrow we will programmatically traverse the HTML text with to accomplish step 2.[3].

[3] but instead of displaying a whole page graphically we just want to extract data...

HTTP.jl

As an example, let's say we want to cross check the official CDC case numbers with the NYT's data. While the NYT data is structured in a Github repository, the CDC website linked above simply displays the total case numbers in an HTML table. While this is optimized for human consumption (i.e. someone visiting the CDC site in their web browser), it is a little cumbersome for computer consumption....

The first step in programmatically getting the CDC case numbers[3] is to get the raw HTML of the web page. As we will hopefully start to get accustomed to, let's activate our environment and install any new packages:

[3] say if we wanted to automatically perform this check or update a dashboard everyday without having to visit the web page in our browser and update our code manually.....
julia> ] # enter Pkg REPL
(@v1.4) pkg> activate . # activate our environment
(data-dailies) pkg> add HTTP, HttpCommon

GET vs. POST

The HTTP.jl interface should feel reminiscent of the download() function we used before:

using HTTP
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
response = HTTP.request("GET", url)

A few small, but important differences however. Instead of just giving the HTTP.request() function a url, we also specify a HTTP verb. There are a lot on intricacies to HTTP methods but the two main methods (you as a web scraper) will use are GET and POST. A GET basically (as the name implies) requests[4] some data from the server and a POST sends[5] data to a server.

[4] what happens when you type a url in your browser address bar
The download method we used before behaves quite similarly to you typing an address in your browser and hitting enter (and makes a GET request). Now already you might be starting to notice the limits of download....

[5] what happens when you enter data in a form on a web page

Responses

What if the url endpoint we are scraping expects input (like an API)? Or what if you are interested in more than just the HTML body of the response (like the status code or headers)?

This is exactly what HTTP.jl exposes for us, but what it gives us in flexibility it trades for convenience (and is a little bit more low level than download()[6]).

[6] like a manual vs. automatic car....
# print HTTP status code
println(response.status)

# inspect the response headers
response.headers
200
8-element Array{Pair{SubString{String},SubString{String}},1}:
                "Content-Type" => "text/html"
                         "SRV" => "3"
 "Access-Control-Allow-Origin" => "*"
             "X-UA-Compatible" => "IE=edge"
                        "Date" => "Tue, 16 Jun 2020 06:12:29 GMT"
           "Transfer-Encoding" => "chunked"
                  "Connection" => "keep-alive, Transfer-Encoding"
   "Strict-Transport-Security" => "max-age=31536000 ; includeSubDomains ; preload"

From the status code of 200 we can see that the HTTP response was returned successfully, and the headers provide additional context on the response[7].

[7] The Content-Type letting us know how to parse the body, and the Date letting us know when the response was returned.
# peak at first 5 lines of HTML response
split(String(response.body), '\n')[1:5]
5-element Array{SubString{String},1}:
 "\r"
 "<!DOCTYPE html>\r"
 "<html lang=\"en-us\" class=\"theme-cyan\" >\r"
 "<head>\r"
 "\t\r"

The raw HTML text returned from a programmatic request like this can often be quite messy[8] and here we will just glimpse at the first few lines. One quirk of the HTTP response is that it behaves like any other File IO in Julia and the input stream gets "exhausted" once you read it.

[8] browsers (thankfully) hide so so many details from us.
# body is empty if we try to re-read the same response....
response.body
0-element Array{UInt8,1}

So if you want to repeatedly read/traverse the body[9] you should read it into a variable.

[9] the body is only "exhausted" if you read from it or use a method that does (like coercing it into a String).
# HTTP Helper functions by JuliaWeb
using HttpCommon

url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
r = HTTP.request("GET", url)

# read the body into a String
status, headers, body = r.status, r.headers , String(r.body)

# escape HTML so `this` webpage doesn't format it
show("$(escapeHTML(body)[1:22])....")
r.body
"\r\n<!DOCTYPE html>...."
0-element Array{UInt8,1}

That covers many of the mechanics of downloading content programmatically but it is not too useful to us (or anyone for that matter) in its raw form. Tomorrow we will get into the specifics of parsing and traversing the raw HTML with Julia to find (and extract) the relevant information from the web page.

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Scraping the Web with Julia 🏄 Making HTTP Requests.

This work is published from: United States.

🔮 A production of the hyphaebeast.club 🔮