Data Dailies
💾 Updated on June 07, 2020

We are finally doing it! Our very first real data set and our first line of Julia 🙌

As I aluded to in the previous post we will be working with the data from the COVID Tracking Project (and maybe eventually the NYT COVID case data). While we won't get to doing any data processing or statistics with the data (that's for a another day), today is all about the various ways to get data into Julia. And by association... the various data formats you might encounter in the wild.

  1. Downloading Files
  2. CSV
  3. JSON
  4. References and Extras

Downloading Files

First things first, let's actually get the files. You can download files with a web browser if you want, use an existing file on your computer, or maybe even use a built-in dataset in some Julia package. Since we will eventually want to programmatically download files (and possible automate this process), I wanted to see how to do this purely in Julia. Thankfully Julia Base has a convenient function to do exactly this.

url = "https://covidtracking.com/api/v1/states/current.csv"
download(url,  "data/covid-current.csv")
"data/covid-current.csv"
In Julia, single quotes (') represent a single character while double quotes (") correspond to strings[1]
[1] if use single quotes for the url the download() function will complain.
Now this just downloads a file from a url to a given location on your computer ("data/covid-current.csv"). We still need to read and parse the file.

If we want a quick and dirty method to just inspect the file, we can use the Julia REPL's shell mode. If you type a ; in the REPL, you should notice the julia> prompt turns into a shell> prompt. And in the shell> prompt we can run any command we could on the command line.

julia> ; # break into a command line shell
shell> head -n 2 data/covid-current.csv
state,positive,positiveScore,negativeScore,negativeRegularScore,commercialScore,grade,score,notes,dataQualityGrade,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,recovered,lastUpdateEt,checkTimeEt,death,hospitalized,total,totalTestResults,posNeg,fips,dateModified,dateChecked,hash
AK,412,1,1,1,1,A,4,"Please stop using the ""total"" field. Use ""totalTestResults"" instead. As of 4/24/20, ""grade"" is deprecated. Please use ""dataQualityGrade"" instead.",B,45951,,14,,,,,,364,5/27 00:00,5/27 14:50,10,,46363,46363,46363,02,2020-05-27T04:00:00Z,2020-05-27T18:50:00Z,3caed233dafae54569923347849005e47cc21c02

Now this is a little unreadable since we haven't done any parsing or formatting yet (that's what our Julia packages are for after all) but at least we can peak at the header names and verify that the columns are indeed comma-seperated[2].

[2] You never can trust the file extensions of files you download off the Internet....

CSV

There are a myriad of possible file types/formats out there (and an equally multiplicitous set of Julia pacakges to handle them). But while there are many many possibilities for the types of files you can encounter out there, chances are the file you want to work with is one of a few common types. Possible the most common of these being the humble DSV (delimited-seperated values).

The most popular DSV of course uses a comma (i.e. CSV). In Julia, the CSV.jl package can support any type of delimiter, the default being a comma.

Before we can use CSV.jl we have to add the package (optionally in our environment):

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate . # create an environment
(data-dailies) pkg> add CSV
   Updating registry at `~/.julia/registries/General`
   ...
using CSV, DataFrames
data = CSV.read(joinpath("data", "covid-current.csv"); delim=',')
print(first(data, 5))
5×39 DataFrames.DataFrame
│ Row │ date     │ state  │ positive │ negative │ pending │ hospitalizedCurrently │ hospitalizedCumulative │ inIcuCurrently │ inIcuCumulative │ onVentilatorCurrently │ onVentilatorCumulative │ recovered │ dataQualityGrade │ lastUpdateEt    │ dateModified         │ checkTimeEt │ death │ hospitalized │ dateChecked          │ totalTestsViral │ positiveTestsViral │ negativeTestsViral │ positiveCasesViral │ fips  │ positiveIncrease │ negativeIncrease │ total  │ totalTestResults │ totalTestResultsIncrease │ posNeg │ deathIncrease │ hospitalizedIncrease │ hash                                     │ commercialScore │ negativeRegularScore │ negativeScore │ positiveScore │ score │ grade   │
│     │ Int64    │ String │ Int64    │ Int64⍰   │ Int64⍰  │ Union{Missing, Int64} │ Union{Missing, Int64}  │ Int64⍰         │ Int64⍰          │ Union{Missing, Int64} │ Union{Missing, Int64}  │ Int64⍰    │ String           │ String          │ String               │ String      │ Int64 │ Int64⍰       │ String               │ Int64⍰          │ Int64⍰             │ Int64⍰             │ Int64⍰             │ Int64 │ Int64            │ Int64            │ Int64  │ Int64            │ Int64                    │ Int64  │ Int64         │ Int64                │ String                                   │ Int64           │ Int64                │ Int64         │ Int64         │ Int64 │ Missing │
├─────┼──────────┼────────┼──────────┼──────────┼─────────┼───────────────────────┼────────────────────────┼────────────────┼─────────────────┼───────────────────────┼────────────────────────┼───────────┼──────────────────┼─────────────────┼──────────────────────┼─────────────┼───────┼──────────────┼──────────────────────┼─────────────────┼────────────────────┼────────────────────┼────────────────────┼───────┼──────────────────┼──────────────────┼────────┼──────────────────┼──────────────────────────┼────────┼───────────────┼──────────────────────┼──────────────────────────────────────────┼─────────────────┼──────────────────────┼───────────────┼───────────────┼───────┼─────────┤
│ 1   │ 20200615 │ AK     │ 664      │ 73773    │ missing │ 21                    │ missing                │ missing        │ missing         │ 3                     │ missing                │ 417       │ A                │ 6/15/2020 00:00 │ 2020-06-15T00:00:00Z │ 06/14 20:00 │ 12    │ missing      │ 2020-06-15T00:00:00Z │ 74437           │ missing            │ missing            │ missing            │ 2     │ 3                │ 967              │ 74437  │ 74437            │ 970                      │ 74437  │ 0             │ 0                    │ 6b08035ecccc3d7c158bd1ebff8a325714b92a03 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 2   │ 20200615 │ AL     │ 26272    │ 276402   │ missing │ 546                   │ 2259                   │ missing        │ 676             │ missing               │ 395                    │ 13508     │ B                │ 6/15/2020 11:00 │ 2020-06-15T11:00:00Z │ 06/15 07:00 │ 774   │ 2259         │ 2020-06-15T11:00:00Z │ missing         │ missing            │ missing            │ 25892              │ 1     │ 657              │ 4562             │ 302674 │ 302674           │ 5219                     │ 302674 │ 1             │ 4                    │ 5a2e19c16964661d50217b2850723b167c2255b8 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 3   │ 20200615 │ AR     │ 12917    │ 191221   │ missing │ 206                   │ 1003                   │ missing        │ missing         │ 45                    │ 163                    │ 8352      │ B                │ 6/15/2020 00:00 │ 2020-06-15T00:00:00Z │ 06/14 20:00 │ 182   │ 1003         │ 2020-06-15T00:00:00Z │ missing         │ missing            │ missing            │ 12917              │ 5     │ 416              │ 6733             │ 204138 │ 204138           │ 7149                     │ 204138 │ 3             │ 5                    │ b09ef2ffe500407eb08f0c06674367d8b4eb4d37 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 4   │ 20200615 │ AS     │ 0        │ 174      │ missing │ missing               │ missing                │ missing        │ missing         │ missing               │ missing                │ missing   │ C                │ 6/1/2020 00:00  │ 2020-06-01T00:00:00Z │ 05/31 20:00 │ 0     │ missing      │ 2020-06-01T00:00:00Z │ missing         │ missing            │ missing            │ missing            │ 60    │ 0                │ 0                │ 174    │ 174              │ 0                        │ 174    │ 0             │ 0                    │ 9fbf373597d3b20bdf748d1567bfb5daf0a1ade9 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 5   │ 20200615 │ AZ     │ 36705    │ 308552   │ missing │ 1449                  │ 3750                   │ 464            │ missing         │ 307                   │ missing                │ 6462      │ A+               │ 6/15/2020 00:00 │ 2020-06-15T00:00:00Z │ 06/14 20:00 │ 1194  │ 3750         │ 2020-06-15T00:00:00Z │ 344929          │ missing            │ missing            │ 36377              │ 4     │ 1014             │ 6198             │ 345257 │ 345257           │ 7212                     │ 345257 │ 8             │ 24                   │ 63be08bad71d36297e4a1aaeb41b67b953a669ad │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │

Internally the CSV.jl package returns a DataFrame, which behaves pretty similarly to other dataframe libraries in other languages (like R's data.frame or Python's pandas). We won't get too much into the specifics of Julia's DataFrames here, but the TLDR; of them is that they behave like a table (or two dimensional matrix) with row and column indeces.

CSV.jl does its best to infer the type of each column, but you might notice some columns with Int64? (and other types with a ?). For columns with missing values (and no user supplied type), the library tries to guess the most appropriate type[3].

[3] Also, since a non-decimal numeric value can be represented both by an Int or a Float, Julia assumes the most specific type (i.e. all Ints can be represented by Floats but not all Floats can be represented as an Int)

JSON

A nicety of the COVID Tracking Project is that the same data is published in both CSV and JSON formats. We can use the same download function as before and just use the JSON url instead.

url = "https://covidtracking.com/api/v1/states/current.json"
download(url,  "data/covid-current.json")
"data/covid-current.json"

And analougous to CSV.jl we have the JSON.jl package. The only difference here is that the JSON.jl package parses JSON files into an Array of Dicts:

julia> ]
(@v1.4) pkg> activate . # load previously created environment
(data-dailies) pkg> add JSON
using JSON
data = JSON.parsefile("data/covid-current.json")
print("The loaded JSON is of type: $(typeof(data))")
data[1]
The loaded JSON is of type: Array{Any,1}
Dict{String,Any} with 39 entries:
  "negativeIncrease" => 967
  "totalTestResultsIncrease" => 970
  "negativeScore" => 0
  "inIcuCumulative" => nothing
  "negativeTestsViral" => nothing
  "checkTimeEt" => "06/14 20:00"
  "inIcuCurrently" => nothing
  "recovered" => 417
  "state" => "AK"
  "positiveCasesViral" => nothing
  "positiveIncrease" => 3
  "negativeRegularScore" => 0
  "posNeg" => 74437
  "totalTestsViral" => 74437
  "hospitalizedCumulative" => nothing
  "deathIncrease" => 0
  "positiveScore" => 0
  "grade" => ""
  "lastUpdateEt" => "6/15/2020 00:00"
  "hospitalizedIncrease" => 0
  "fips" => "02"
  "pending" => nothing
  "total" => 74437
  "totalTestResults" => 74437
  "hash" => "6b08035ecccc3d7c158bd1ebff8a325714b92a03"
  "hospitalized" => nothing
  "date" => 20200615
  "dateChecked" => "2020-06-15T00:00:00Z"
  "score" => 0
  "positiveTestsViral" => nothing
  "onVentilatorCumulative" => nothing
  "dataQualityGrade" => "A"
  "hospitalizedCurrently" => 21
  "commercialScore" => 0
  "negative" => 73773
  "positive" => 664
  "dateModified" => "2020-06-15T00:00:00Z"
  "death" => 12
  "onVentilatorCurrently" => 3

Now we only saw how to deal with CSV and JSON files but the JuliaIO Github organization has a (seemingly) exhaustive list of pacakges for various file formats. And also I listed some common packages in the extras 👇

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to Getting Data into Julia.

This work is published from: United States.

🔮 A production of the hyphaebeast.club 🔮