Data Dailies
💾 Updated on May 31, 2020

Welcome back! This is the second part of the posts related to web scraping with Julia. Yesterday's post was on making HTTP requests programmatically in Julia with HTTP.jl. So if you haven't already seen it head over there (if want to). For today, I will try to provide a very brief primer on how HTML is structured so we have some context when traversing it.

HTML Primer

Most of us know HTML as a somewhat removed concept[1] since web browsers take care of all the messy work of formatting, styling, and displaying HTML in a pleasing graphic form. But since we are interested in programmatically extracting data from a web page, we have to perform the tasks typically relegated to the web browser.

[1] even though we interact with it on a daily basis (you are doing it right now 😱)
Additionally, since the information we want is usually nested deep in the raw HTML text, we need to traverse the hierarchical HTML structure to find it . Without getting into the gory details of the HTML specification, for our parsing purposes there are just 2 key points to note:

  1. HTML is SGML (an XML like markup format)

  2. HTML is hierarchical

Point 1 is why many XML libraries (like EzXML.jl) can be used with HTML. Point 2 is how we can (somewhat) efficiently find the elements we are interested in without needing to traverse the entire web page.


All HTML documents are comprised of HTML elements, which themselves are composed of HTML tags. Most (but not all) HTML tags you will encounter have an opening tag, some content, and a closing tag. When web scraping often you are trying to get some content, but occasionally you might need to extract information in the tag itself (like a hyperlink url).

All of the key=value pairs that are within the tag itself (between the < >) are called attributes. Some relevant one examples for scraping are:

classanyCSS selector<div class="note"></div>
idanyCSS Selector<div id="two" class="note"></div>
href<a>specify url<a href="">Blog</a>

CSS Selectors

While there are many ways to specify which element of the page you want, some are much nicer than others (but as with all things you usually trade convenience for power). CSS is a style sheet language that allows developers to specify how HTML content should be presented.

Since most HTML you will encounter will be styled with CSS—and since CSS uses selectors to specify which elements should be styled how—CSS selectors are often the most natural and straightforward way to traverse an HTML document. While CSS selectors can be combined to represent their own very complicated match rules (like regexs), the types of selectors can be grouped into the following:

tag nameh1 { color: yellow; }
class.note {color: yellow; }
id#two { color: yellow; }
attributea[href=""] { color: yellow; }
combinators.note > span { color: yellow; }

References and Extras

To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Scraping the Web with Julia 🏄 HTML Primer.

This work is published from: United States.

🔮 A production of the 🔮