§ Tidy Data
- Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
While the order of variables and observations does not affect analysis, a good ordering makes
it easier to scan the raw values. One way of organising variables is by their role in the analysis:
are values fixed by the design of the data collection, or are they measured during the course of
the experiment? Fixed variables describe the experimental design and are known in advance.
Computer scientists often call fixed variables dimensions, and statisticians usually denote them
with subscripts on random variables. Measured variables are what we actually measure in the
study. Fixed variables should come first, followed by measured variables, each ordered so that
related variables are contiguous. Rows can then be ordered by the first variable, breaking
ties with the second and subsequent (fixed) variables. This is the convention adopted by all
tabular displays in this paper.
§ Messy 1: Column headers are values, not variable names
- eg. columns are
religion |<$10k |$10-20k |$20-30k |$30-40k |$40-50k |$50-75k
. - melt dataset to get
molten
stacked data.
§ Messy 2: Multiple variables stored in one column
- This often manifests after melting.
- eg. columns are
country | year | m014 | m1524 | .. | f014 | f1524...
- columns represent both sex and age ranges. After metling, we get a single column
sexage
with entries like m014
or f1524
- The data is still molten , so we should reshape it before it sets into tidy columnlar data. We do this by splitting the column into two, one for
age
and one for sex
.
§ Messy 3: Variables are stored in both rows and columns
id year month element d1 d2 d3 d4 d5 ...
MX17004 2010 1 tmax — — — — — — — —
MX17004 2010 1 tmin — — — — — — — —
MX17004 2010 2 tmax — 27.3 24.1 — — — — —
MX17004 2010 2 tmin — 14.4 14.4 — — — — —
MX17004 2010 3 tmax — — — — 32.1 — — —
MX17004 2010 3 tmin — — — — 14.2 — — —
MX17004 2010 4 tmax — — — — — — — —
MX17004 2010 4 tmin — — — — — — — —
MX17004 2010 5 tmax — — — — — — — —
MX17004 2010 5 tmin — — — — — — — —
- Some variables are in individual columns (id, year, month)
- Some variables are spread across columns (day is spread as d1–d31)
- Some variables are smearted across rows (eg.
tmax/tmin
). TODO: what does this mean, really? - First, tidy by collating into
date
:
id date element value
MX17004 2010-01-30 tmax 27.8
MX17004 2010-01-30 tmin 14.5
MX17004 2010-02-02 tmax 27.3
MX17004 2010-02-02 tmin 14.4
MX17004 2010-02-03 tmax 24.1
MX17004 2010-02-03 tmin 14.4
MX17004 2010-02-11 tmax 29.7
MX17004 2010-02-11 tmin 13.4
MX17004 2010-02-23 tmax 29.9
MX17004 2010-02-23 tmin 10.7
- Dataset above is still molten. Must reshape along
element
to get two columns for max
and min
. This gives:
id date tmax tmin
MX17004 2010-01-30 27.8 14.5
MX17004 2010-02-02 27.3 14.4
MX17004 2010-02-03 24.1 14.4
MX17004 2010-02-11 29.7 13.4
MX17004 2010-02-23 29.9 10.7
MX17004 2010-03-05 32.1 14.2
MX17004 2010-03-10 34.5 16.8
MX17004 2010-03-16 31.1 17.6
MX17004 2010-04-27 36.3 16.7
MX17004 2010-05-27 33.2 18.2
- Months with less than 31 days have structural missing values for the last day(s) of the month.
- The element column is not a variable; it stores the names of variables.
§ Multiple types in one table:
§ data manipulation, relationship to dplyr
:
- Data transformation in R for data science
-
mutate()
adds new variables that are functions of existing variables -
select()
picks variables based on their names. -
filter()
picks cases based on their values. -
summarise()
reduces multiple values down to a single summary. -
arrange()
changes the ordering of the rows.
§ Visualization
- Most of R's visualization ecosystem is tidy by default.
- base
plot
, lattice
, ggplot
are all tidy.
§ Modelling
- Most modelling tools work best with tidy datasets.
§ Questions about performance benching in terms of tidy
- Is runs of a program at different performance levels like
O1
, O2
, O3
to be stored as separate columns? Or as a categorical column called "optimization level" with entries stored in separate rows of O1
, O2
, O3
? - If we go by the tidy rule "Each variable forms a column", then this suggests that
optimization level
is a variable. - Then the tidy rule
Each observation forms a row.
makes us use rows like [foo.test | opt-level=O1 | ]
and [foo.test | opt-level=O2 | ]
. - Broader question: what is the tidy rule for categorical column?
- However, in the tidy data paper, Table 12, it is advocated to have two columns for
tmin
and tmax
instead of having a column called element
with choices tmin
, tmax
. So it seems to be preferred that if one has a categorical variable, we make its observations into columns. - This suggests that I order my bench data as
[foo.test | O1-runtime=_ | O2-runtime=_ | O3-runtime=_ ]
.