Data Dailies
๐Ÿ’พ Updated on May 28, 2020

Now that we have Julia setup on our local machine, it is time to really start cooking! And by cooking I mean analyzing some interesting data...

In (maybe?) contradiction to the vastly more common approach of teaching computational topics from a bottom up approach, these tutorials will (possibly to a fault) be very top down. What do I mean by that? Usually with computer science topics, and even more so with computational approaches to analytical topics, teaching often starts from the most basic/atomic concepts: i.e. learn data science by writing algorithms (and to write algorithms you first have to learn about variables and for loops).

While this often seems like a perfectly logical approach for an expert trying to teach novices (teacher/student), this method of curriculum development really optimizes for the teacher's experience rather than learner's[1]. Much like building a house, for a builder it is vastly more efficient to start pouring the concrete for the foundation (and pretty impossible to do otherwise). But for the homeowner, it is much more rewarding and exciting to see windows, a faรงade, and a siiiiiic door knocker.

[1] an innocuously perverse kind of way.
  1. Bootstrapping Curricula
    1. An example with regression
  2. Finding interesting problems (and data)
    1. Inside Airbnb
    2. The COVID Tracking Project
  3. My Case Study

Bootstrapping Curricula

Thankfully for me/us we are not building a house.... we are leaning how to do data science in Julia! And also I am not a Julia expert. So how does a novice teach a novice? Especially when that novice is themself?

I personally find the quickest approach to learn a new topic is to setup an engaging problem you are curious about that has a semi-definite answer (i.e. you'll know the answer when you see it). Then develop a sequence of steps to get you to that answer.

It can be a little mind-bendy to think about learning in this way but it is similar to the steps of a proof by (backwards) induction (or recursion). Just like with the mathematical proof however, proving the base case is a non-trivial task (that may not even be possible).

The base case in our curricula bootstrapping is that an answer exists for the problem you setup for yourself. But since you haven't actually solved this problem you somehow need to determine whether or not you scoped the problem correctly. And once you figure out the base case—there does in fact exist a solution to this problem)[2]—you then work backwards on a (slightly) smaller problem (the N-1 case). And repeat:

[2] You can definitely learn from an unsolvable (or unsolvable-in-the-amount-of-time-you-have) problem, but it is often much more rewarding when learning a new subject to feel a sense of accomplishing something.
  1. Does a solution exist for (sub)problem N?

  2. Am I at N = 0 (i.e. a subproblem I know the solution to)?

    • If no, go to back to step 1 and set N := N - 1[3]

    • If yes, celebrate you did it!

[3] Implicit in this step is determining a suitable subproblem that corresponds to N-1, ideally within a comfortable ZPD for the learner (i.e. yourself).
๐Ÿ†’ ๐Ÿ†’ ๐Ÿ†’, that all sounds great in theory but it is still a little abstract.... What might this look like in practice. With an example ๐Ÿคž?

An example with regression

  1. Brainstorm a motivating problem you are curious about

    • Are higher rated Airbnb listings more expensive per night?

  2. Research if it can be been solved[4]

    • Has anyone written a blog post/research paper/conference talk showing this?

    • Or can I imagine a way it could be solved given what I currently know?

  3. Decompose it into managable subproblems (that you assume can be solved)

    • How can I get a structured dataset of Airbnb listing that contains both price and ratings?

[4] Often the easiest way is to find if anyone has done it before.

Finding interesting problems (and data)

Many data science learning resources often use well structured and "canned" datasets. This has the benefit of circumventing step 1 and being much more predictable/reliable. But what one gains in convenience, one loses in motivation. Now maybe some people really do care about the fuel economy of cars from the 1974 Motor Trend issue[5].

[5] These canonical datasets are canonical for good reason however. And are actually decent for learning machine learning/modeling (among other things).
A more pedagical reason to avoid these types of datasets/problems is inextricably linked to the nature of data science. The science of data science (and what IMO distinguishes it from statistics) is actually in determining the right question and in choosing a suitable answer given the context. And often folks teaching data science "punt" on this saying that you can't teach these things, only experience can turn an apprentice into a journeyman. But having generations of data science students work through identifying Iris species certainly doesn't help.

If I can coin a term here[6]simulated experience: a structured exercise that emulates the challenges and uncertainties of a real world task.

[6] I mean I will, it is my blog after all.
By making a learning experience personal/individualized, learners are more likely to stay motivated/interested and also they might find novel challenges. And by picking an authentic task, it can aid in transfer learning and problem solving when a student is placed in a real world scenario. Both of these things can help with motivation, metacognition, and the unseen of factors learning (besides just questions, answers, and abstract concepts).

  1. The data can be faceted by region (city/county) so everyone can find a facet that they have a particular affinity towards.

  2. These data are generated from real world processes and contain authentic problems (that Airbnb and epidemiologists have likely solved)

Two of my favorite data sources that fit these criteria are ๐Ÿ‘‡

Inside Airbnb

The COVID Tracking Project

My Case Study

Now that we have learned how to learn, we are metacognitively[7] setup for success! For the forseeable future posts, the long running (and timely) case study we will be working with is analyzing the COVID tracking project data.

[7] Roger Azevedo. Computer environments as metacognitive tools for enhancing learning. Educational Psychologist. 2005.
DISCLAIMER: I am not an epidemiologist nor a public health professional. Always take care when making inferences from any data, especially if it is unfamiliar high-stakes data (like the COVID testing date is for me).

I promise tomorrow we will write at least one line of Julia code ๐Ÿ˜…

To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Bootstrapping Curricula (for yourself).

This work is published from: United States.

๐Ÿ”ฎ A production of the ๐Ÿ”ฎ