Data Dailies
💾 Updated on June 05, 2020

R made them cool, pandas brough them to Python, and Julia has DataFrames.jl...

  1. A Data what?
  2. DataFrame data types
  3. Indexing
  4. References and Extras

A Data what?

DataFrames are perhaps the most ubiquitous data structure in data science, mainly because at their core they simply represent tabular data[1]. And while they may seem like a matrix or multidimensional array on the surface, it is what is inside that makes them really special.

[1] and it is hard to throw a data science stone without hitting tables...
Like a vector/matrix[2] in Julia, DataFrames support mutable constant time row and column operations (i.e. get me column 2 of row 12) except that the columns in a DataFrame usually have names. So instead of column 2 of row 12 you can say get me the total_cases of row 12 (unfortunately DataFrames.jl doesn't support row indices like in pandas).

[2] Vector and Matrix are really just syntactic sugar for a one- and two- dimensional Array.
But besides clarity, DataFrame's lookup mechanism is actually much more powerful since the row/column indices don't have to be sequential (as in [1, 2, 3, etc.]) and function like a dictionary of dictionaries with key lookups.

DataFrame data types

Because DataFrame's can hold heterogeneous types[3] we can group on certain columns and compute on others (like get the mean). A DataFrame is actually one of the most flexible data structures (while maintaining modest performance). Because of this it is very well suited to data manipulation and exploratory data analysis tasks. Or any time when you may not know the exact type/structure of your data a priori.

[3] each column can have it's own type.
tupl = ("a", 1, [3,4])
arr = ["a", 1, [3,4]]

println("$(typeof(tupl))")
println("$(typeof(arr))")
Tuple{String,Int64,Array{Int64,1}}
Array{Any,1}

In the above code, notice that while an Array can hold multiple types, the Array itself can only have a single type (in this case Any). This is the most specific type that can describe all the elements. But for the Tuple, the individual elements retain their own types (like a DataFrame). But unlike a DataFrame, a Tuple is immutable, for better or worse.

julia> ] # enter Pkg REPL
(@v1.4) pkg> activate .
(data-dailies) pkg> add DataFrames
using DataFrames

df = DataFrame(letter = 'a':'d',
               num = 1:4,
               other = [(1,3), 20, [4,5,6], "Jon"])
print(df)
4×3 DataFrames.DataFrame
│ Row │ letter │ num   │ other     │
│     │ Char   │ Int64 │ Any       │
├─────┼────────┼───────┼───────────┤
│ 1   │ 'a'    │ 1     │ (1, 3)    │
│ 2   │ 'b'    │ 2     │ 20        │
│ 3   │ 'c'    │ 3     │ [4, 5, 6] │
│ 4   │ 'd'    │ 4     │ Jon       │

As you can see above, each column of a DataFrame functions similarly to a single Array w.r.t. types, but from one column to the next they can have very different types. So in this sense it almost functions like a Tuple of Arrays....

If you remember back alllll the way to last Thursday we actually have already encountered DataFrames when we read in data with the CSV.jl package.
letter = ['a', 'b', 'c', 'd']
num = [1, 2, 3, 4]
other = [(1,3), 20, [4,5,6], "Jon"]

typeof(tuple(letter, num, other))
Tuple{Array{Char,1},Array{Int64,1},Array{Any,1}}

Indexing

In additional to being a very flexible data structure, DataFrame's also allow us to index into them in very flexible ways.

Julia has a convenient built-in pipe operator (|>) that allows you to chain function calls as well as the compose function ().
using CSV

# load in the COVID data set from last week
data = CSV.read(joinpath("data", "covid-current.csv"); delim=',')

# inspect first 5 rows
first(data, 5) |> println
println()

# inspect last 5 rows
last(data, 5)
5×35 DataFrames.DataFrame
│ Row │ date     │ state  │ positive │ negative │ pending │ hospitalizedCurrently │ hospitalizedCumulative │ inIcuCurrently │ inIcuCumulative │ onVentilatorCurrently │ onVentilatorCumulative │ recovered │ dataQualityGrade │ lastUpdateEt   │ dateModified         │ checkTimeEt │ death │ hospitalized │ dateChecked          │ fips  │ positiveIncrease │ negativeIncrease │ total  │ totalTestResults │ totalTestResultsIncrease │ posNeg │ deathIncrease │ hospitalizedIncrease │ hash                                     │ commercialScore │ negativeRegularScore │ negativeScore │ positiveScore │ score │ grade   │
│     │ Int64    │ String │ Int64    │ Int64⍰   │ Int64⍰  │ Union{Missing, Int64} │ Union{Missing, Int64}  │ Int64⍰         │ Int64⍰          │ Union{Missing, Int64} │ Union{Missing, Int64}  │ Int64⍰    │ String           │ String         │ String               │ String      │ Int64 │ Int64⍰       │ String               │ Int64 │ Int64            │ Int64            │ Int64  │ Int64            │ Int64                    │ Int64  │ Int64         │ Int64                │ String                                   │ Int64           │ Int64                │ Int64         │ Int64         │ Int64 │ Missing │
├─────┼──────────┼────────┼──────────┼──────────┼─────────┼───────────────────────┼────────────────────────┼────────────────┼─────────────────┼───────────────────────┼────────────────────────┼───────────┼──────────────────┼────────────────┼──────────────────────┼─────────────┼───────┼──────────────┼──────────────────────┼───────┼──────────────────┼──────────────────┼────────┼──────────────────┼──────────────────────────┼────────┼───────────────┼──────────────────────┼──────────────────────────────────────────┼─────────────────┼──────────────────────┼───────────────┼───────────────┼───────┼─────────┤
│ 1   │ 20200607 │ AK     │ 544      │ 64360    │ missing │ 7                     │ missing                │ missing        │ missing         │ 1                     │ missing                │ 382       │ B                │ 6/7/2020 00:00 │ 2020-06-07T00:00:00Z │ 06/06 20:00 │ 10    │ missing      │ 2020-06-07T00:00:00Z │ 2     │ 8                │ 995              │ 64904  │ 64904            │ 1003                     │ 64904  │ 0             │ -48                  │ 62adbd451838656b7df7519e830d6439be0b5877 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 2   │ 20200607 │ AL     │ 20500    │ 239066   │ missing │ missing               │ 2022                   │ missing        │ 615             │ missing               │ 364                    │ 11395     │ B                │ 6/7/2020 11:00 │ 2020-06-07T11:00:00Z │ 06/07 07:00 │ 692   │ 2022         │ 2020-06-07T11:00:00Z │ 1     │ 457              │ 13465            │ 259566 │ 259566           │ 13922                    │ 259566 │ 3             │ 29                   │ 9040674078ce6afca363f8e95943845a032ab5d6 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 3   │ 20200607 │ AR     │ 9426     │ 150847   │ missing │ 145                   │ 844                    │ missing        │ missing         │ 35                    │ 143                    │ 6424      │ A                │ 6/7/2020 16:10 │ 2020-06-07T16:10:00Z │ 06/07 12:10 │ 154   │ 844          │ 2020-06-07T16:10:00Z │ 5     │ 325              │ 3191             │ 160273 │ 160273           │ 3516                     │ 160273 │ 0             │ 6                    │ ef23d4d3f9e232bb5f58a59d79a27d2cb0797e2a │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 4   │ 20200607 │ AS     │ 0        │ 174      │ missing │ missing               │ missing                │ missing        │ missing         │ missing               │ missing                │ missing   │ C                │ 6/1/2020 00:00 │ 2020-06-01T00:00:00Z │ 05/31 20:00 │ 0     │ missing      │ 2020-06-01T00:00:00Z │ 60    │ 0                │ 0                │ 174    │ 174              │ 0                        │ 174    │ 0             │ 0                    │ 893135d0d7a9340a91aca139f4e3bb289f418f71 │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │
│ 5   │ 20200607 │ AZ     │ 26889    │ 254732   │ missing │ 1252                  │ 3352                   │ 392            │ missing         │ 248                   │ missing                │ 5517      │ A+               │ 6/7/2020 00:00 │ 2020-06-07T00:00:00Z │ 06/06 20:00 │ 1044  │ 3352         │ 2020-06-07T00:00:00Z │ 4     │ 1438             │ 8537             │ 281621 │ 281621           │ 9975                     │ 281621 │ 2             │ 32                   │ 505a05efa5a9b912644a7ad16b2ab6f37330806b │ 0               │ 0                    │ 0             │ 0             │ 0     │ missing │

DataFrames.DataFrame(AbstractArray{T,1} where T[[20200607, 20200607, 20200607, 20200607, 20200607], ["VT", "WA", "WI", "WV", "WY"], [1063, 23442, 20835, 2144, 947], Union{Missing, Int64}[40155, 377146, 325867, 109663, 27804], Union{Missing, Int64}[missing, missing, 183, missing, missing], Union{Missing, Int64}[14, 205, 491, 28, 4], Union{Missing, Int64}[missing, 3652, 2848, missing, 90], Union{Missing, Int64}[missing, missing, 107, 9, missing], Union{Missing, Int64}[missing, missing, 638, missing, missing], Union{Missing, Int64}[missing, 57, missing, 4, missing], Union{Missing, Int64}[missing, missing, missing, missing, missing], Union{Missing, Int64}[890, missing, 14047, 1451, 757], ["B", "B", "A+", "B", "C"], ["6/7/2020 00:00", "6/6/2020 02:59", "6/7/2020 00:00", "6/7/2020 10:00", "6/7/2020 11:52"], ["2020-06-07T00:00:00Z", "2020-06-06T02:59:00Z", "2020-06-07T00:00:00Z", "2020-06-07T10:00:00Z", "2020-06-07T11:52:00Z"], ["06/06 20:00", "06/05 22:59", "06/06 20:00", "06/07 06:00", "06/07 07:52"], [55, 1153, 647, 84, 17], Union{Missing, Int64}[missing, 3652, 2848, missing, 90], ["2020-06-07T00:00:00Z", "2020-06-06T02:59:00Z", "2020-06-07T00:00:00Z", "2020-06-07T10:00:00Z", "2020-06-07T11:52:00Z"], [50, 53, 55, 54, 56], [17, 273, 264, 13, 8], [1118, -202, 11065, 2071, 235], [41218, 400588, 346885, 111807, 28751], [41218, 400588, 346702, 111807, 28751], [1135, 71, 11329, 2084, 243], [41218, 400588, 346702, 111807, 28751], [0, 4, 2, 0, 0], [0, 13, 16, 0, 0], ["6821fdfc6d44be667ac620f4b4706080fa3de624", "66df353c7076869706b2034f0611d2958f4b0b1c", "b90a15ee8222ed1d7dc36819c145dfb915e1351e", "bd865dabcfb54d72bf2b6198165ad1207a8d78c1", "6b568085b4cc2c7490d4ced37e03f883c5e1dd34"], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [missing, missing, missing, missing, missing]], DataFrames.Index(Dict(:hospitalizedCurrently => 6,:recovered => 12,:lastUpdateEt => 14,:state => 2,:dataQualityGrade => 13,:hospitalized => 18,:posNeg => 26,:commercialScore => 30,:negativeRegularScore => 31,:positiveScore => 33,:score => 34,:death => 17,:fips => 20,:deathIncrease => 27,:onVentilatorCurrently => 10,:pending => 5,:totalTestResults => 24,:inIcuCurrently => 8,:onVentilatorCumulative => 11,:dateModified => 15,:positiveIncrease => 21,:negative => 4,:positive => 3,:negativeScore => 32,:hospitalizedIncrease => 28,:checkTimeEt => 16,:dateChecked => 19,:grade => 35,:totalTestResultsIncrease => 25,:hospitalizedCumulative => 7,:hash => 29,:date => 1,:inIcuCumulative => 9,:negativeIncrease => 22,:total => 23), [:date, :state, :positive, :negative, :pending, :hospitalizedCurrently, :hospitalizedCumulative, :inIcuCurrently, :inIcuCumulative, :onVentilatorCurrently, :onVentilatorCumulative, :recovered, :dataQualityGrade, :lastUpdateEt, :dateModified, :checkTimeEt, :death, :hospitalized, :dateChecked, :fips, :positiveIncrease, :negativeIncrease, :total, :totalTestResults, :totalTestResultsIncrease, :posNeg, :deathIncrease, :hospitalizedIncrease, :hash, :commercialScore, :negativeRegularScore, :negativeScore, :positiveScore, :score, :grade]))

He we are loading in the COVID data set from the COVID Tracking Project that we downloaded last week. And if you remember, with this data the columns in that data set actually have many different types.

If we want to index specific ranges of columns AND rows we can do that too!

# for rows [3, 17, 20] show me columns [1,2,3]
data[[3, 17, 20], 1:3]
DataFrames.DataFrame(AbstractArray{T,1} where T[[20200607, 20200607, 20200607], ["AR", "IL", "KY"], [9426, 127757, 11287]], DataFrames.Index(Dict(:state => 2,:positive => 3,:date => 1), [:date, :state, :positive]))
# we can also get columns by their name
data[1:5, [:recovered, :death]]
DataFrames.DataFrame(AbstractArray{T,1} where T[Union{Missing, Int64}[382, 11395, 6424, missing, 5517], [10, 692, 154, 0, 1044]], DataFrames.Index(Dict(:death => 2,:recovered => 1), [:recovered, :death]))
names(data)
35-element Array{Symbol,1}:
 :date
 :state
 :positive
 :negative
 :pending
 :hospitalizedCurrently
 :hospitalizedCumulative
 :inIcuCurrently
 :inIcuCumulative
 :onVentilatorCurrently
 :onVentilatorCumulative
 :recovered
 :dataQualityGrade
 :lastUpdateEt
 :dateModified
 :checkTimeEt
 :death
 :hospitalized
 :dateChecked
 :fips
 :positiveIncrease
 :negativeIncrease
 :total
 :totalTestResults
 :totalTestResultsIncrease
 :posNeg
 :deathIncrease
 :hospitalizedIncrease
 :hash
 :commercialScore
 :negativeRegularScore
 :negativeScore
 :positiveScore
 :score
 :grade

One nicety that DataFrames.jl has over pandas is built-in conditional logic for column selections:

# select columns based on regular expression
data[!, r"(pos|neg)"] |> head
DataFrames.DataFrame(AbstractArray{T,1} where T[[544, 20500, 9426, 0, 26889, 128812], Union{Missing, Int64}[64360, 239066, 150847, 174, 254732, 2233406], [8, 457, 325, 0, 1438, 2796], [995, 13465, 3191, 0, 8537, 51122], [64904, 259566, 160273, 174, 281621, 2362218], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], DataFrames.Index(Dict(:positiveIncrease => 3,:negative => 2,:negativeRegularScore => 6,:posNeg => 5,:positiveScore => 8,:positive => 1,:negativeIncrease => 4,:negativeScore => 7), [:positive, :negative, :positiveIncrease, :negativeIncrease, :posNeg, :negativeRegularScore, :negativeScore, :positiveScore]))
# invert column selection with case-insensitive regex
data[!, Not(r"(score|pos|neg|grade|notes|total|date)"i)] |> head
DataFrames.DataFrame(AbstractArray{T,1} where T[["AK", "AL", "AR", "AS", "AZ", "CA"], Union{Missing, Int64}[missing, missing, missing, missing, missing, missing], Union{Missing, Int64}[7, missing, 145, missing, 1252, 4525], Union{Missing, Int64}[missing, 2022, 844, missing, 3352, missing], Union{Missing, Int64}[missing, missing, missing, missing, 392, 1288], Union{Missing, Int64}[missing, 615, missing, missing, missing, missing], Union{Missing, Int64}[1, missing, 35, missing, 248, missing], Union{Missing, Int64}[missing, 364, 143, missing, missing, missing], Union{Missing, Int64}[382, 11395, 6424, missing, 5517, missing], ["06/06 20:00", "06/07 07:00", "06/07 12:10", "05/31 20:00", "06/06 20:00", "06/06 20:00"], [10, 692, 154, 0, 1044, 4626], Union{Missing, Int64}[missing, 2022, 844, missing, 3352, missing], [2, 1, 5, 60, 4, 6], [0, 3, 0, 0, 2, 67], [-48, 29, 6, 0, 32, 0], ["62adbd451838656b7df7519e830d6439be0b5877", "9040674078ce6afca363f8e95943845a032ab5d6", "ef23d4d3f9e232bb5f58a59d79a27d2cb0797e2a", "893135d0d7a9340a91aca139f4e3bb289f418f71", "505a05efa5a9b912644a7ad16b2ab6f37330806b", "c488c7b5e4d006a61717e637d445fcb324f8e1ee"]], DataFrames.Index(Dict(:hospitalizedCurrently => 3,:recovered => 9,:hospitalized => 12,:state => 1,:death => 11,:fips => 13,:deathIncrease => 14,:onVentilatorCurrently => 7,:pending => 2,:inIcuCurrently => 5,:onVentilatorCumulative => 8,:hospitalizedIncrease => 15,:checkTimeEt => 10,:hospitalizedCumulative => 4,:hash => 16,:inIcuCumulative => 6), [:state, :pending, :hospitalizedCurrently, :hospitalizedCumulative, :inIcuCurrently, :inIcuCumulative, :onVentilatorCurrently, :onVentilatorCumulative, :recovered, :checkTimeEt, :death, :hospitalized, :fips, :deathIncrease, :hospitalizedIncrease, :hash]))

And finally we can also index based on specific values of certain columns:

# which states have had more than 50,000 positive cases
data[data.positive .> 50000, [:state, :positive]]
DataFrames.DataFrame(AbstractArray{T,1} where T[["CA", "FL", "GA", "IL", "MA", "MD", "MI", "NJ", "NY", "PA", "TX", "VA"], [128812, 63938, 51898, 127757, 103436, 57973, 64413, 164164, 378097, 76212, 74978, 50681]], DataFrames.Index(Dict(:state => 1,:positive => 2), [:state, :positive]))

While we have only scratched the surface of the power and potential[4] of DataFrames.jl, this should be enough to get us started slicing and dicing our data. As we progress through more interesting analysis in these posts I will be sure to call out the additional functionality of DataFrames.jl relevant to the problem at hand.

[4] and haven't even looked at the rich ecosystem of querying frameworks for it...

References and Extras

CC0
To the extent possible under law, Jonathan Dinu has waived all copyright and related or neighboring rights to DataFrames, oh my!.

This work is published from: United States.

🔮 A production of the hyphaebeast.club 🔮