Data Dailies
💾 Updated on June 01, 2020

So it looks like there are now 3 parts to this web scaping thing..... Part 1 was on making HTTP requests, Part 2 was a primer on HTML, and in today's post we will see how to actually extract useful infromation from the raw HTML text that we downloaded from the CDC. While Julia unfortunately doesn't have the most mature HTML/XML parsing packages[1] (and it isn't really a core use case for the language), it's parallel computing is much more friendly and performant than other scripting languages (that might have good parsers...).

[1] parsing HTML correctly (and quickly) is more difficult than it may seem...
  1. HTML as XML
  2. CSS Selectors
  3. References and Extras

HTML as XML

To recap, we will be parsing the CDC web page that we downloaded in part 1[2].

[2] With code reproduced below for reference.
# HTTP Helper functions by JuliaWeb
using HTTP, HttpCommon

url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
r = HTTP.request("GET", url)

# read the body into a String
status, headers, body = r.status, r.headers , String(r.body)

# escape HTML so `this` webpage doesn't format it
show("$(escapeHTML(body)[1:22])....")
"\r\n<!DOCTYPE html>...."

If we were properly building a scraper to run as a script on a recurring basis, we would probably download the raw HTML (and store it) and then read it in and parse it seperately (remember downloading vs. parsing). For the sake of this tutorial however, I will just pass the HTTP response directly into a parsing package.

For our HTML parsing, we will be using EzXML.jl and XPath selectors (to extract elements).

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate .
(data-dailies) pkg> add EzXML
using EzXML

# use `readhtml(filename)` if reading from a file
doc = parsehtml(body)
doc
EzXML.Document(EzXML.Node(Ptr{EzXML._Node} @0x00007fd86830e1d0, EzXML.Node(#= circular reference @-1 =#)))

We can see here that we now have a EzXML.Document (basically a parsed HTML document represented as a Julia struct) that we can either:

In this case, since we are only interested in the cases and deaths from the CDC page, we will use XPath to get as close to the relevant HTML element using its class or id. Shown in the margin is the CDC page with the web inspector open (right click on the element). Usually once I find the element with the content I want, I start at the element and identify the closest (uniquely identifiable) parent element upstream. In this case since we are trying to get the total cases, new cases, total deaths and new deaths, the closest uniquely identifiable element is likely <section class="cases-header"> (since there may be other elements on the page with callout classes).

using HttpCommon

html = root(doc)
xpath = "//section[@class=\"cases-header\"]"
header = findfirst(xpath, html)

# make sure to escape the HTML so it shows on my blog
print(escapeHTML("$header"))
<section class="cases-header">
                            <div class="cases-callouts">
                                <div class="callouts-container">
                                    <div class="callout">
                                        Total Cases
                                        <span class="count">2,085,769</span>
                                        <span class="new-cases">21,957 New Cases*</span>
                                    </div>
                                    <div class="callout">
                                        Total Deaths
                                        <span class="count">115,644</span>
                                        <span class="new-cases">373 New Deaths*</span>
                                    </div>
                                </div>
                                <footer>
                                    <ul>
                                        <li>
                                            *Compared to yesterday's data   
                                        </li>
                                        <li>
                                            <a href="#accordion-1-collapse-2">About the Data</a>   
                                        </li>
                                    </ul>
                                </footer>

                            </div>
                            <a href="https://www.cdc.gov/covid-data-tracker/index.html" style="background-image: url(/coronavirus/2019-ncov/covid-data/images/corona-interactive.png)">
                                <section>
                                    <span class="heading">Want More Data?</span>
                                    <p>CDC COVID Data Tracker</p>
                                </section>
                            </a>
                        </section>

While it is a little messy in how we printed it out, it does indeed look like we got the right element that contains the information we need. So now that we have isolated the relevant elements we can get a little more specific in how we traverse them:

# initialize empty dictionary to store content
data = Dict()

# convenience function to parse strings with commas
parse_number(x) = parse(Int, replace(x, "," => ""))

# get the nested <div>s that hold the case and death numbers
callouts = findfirst("div/div", header)
# extract the first <div>s that corresponds to the cases
cases = firstelement(callouts)

# extract the inner <span> elements that contain the numbers
total, new = map(nodecontent, findall("span", cases))

data["total_cases"] = parse_number(total)

# use a regex to pull out just the new cases number
data["new_cases"] = parse_number(match(r"([\d,]+)", new)[1])
data
Dict{Any,Any} with 2 entries:
  "new_cases" => 21957
  "total_cases" => 2085769

Since the deaths callout has the same structure, I won't show the code here for that but it will be identical to what is above:

deaths = lastelement(callouts)
...

CSS Selectors

I usually prefer to parse/traverse HTML using CSS selectors since (I think) it maps a little more naturally to the structure of HTML, but if you programatically need to traverse an entire document[3] XPath is a bit more powerful/flexible than CSS selectors.

[3] instead of say just extracting the text of a single tag
Also for Julia, the XML packages like EzXML.jl and LightXML.jl seem to be more active than Cascadia.jl (the only CSS selector library). And in general XML libraries are likely to be more predictable since XPath (and XML) is stricter than HTML and CSS selectors.

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Scraping the Web with Julia 🏄 Parsing HTML.

This work is published from: United States.

🔮 A production of the hyphaebeast.club 🔮