§ Tidy Data




1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

While the order of variables and observations does not affect analysis, a good ordering makes
it easier to scan the raw values. One way of organising variables is by their role in the analysis:
are values fixed by the design of the data collection, or are they measured during the course of
the experiment? Fixed variables describe the experimental design and are known in advance.
Computer scientists often call fixed variables dimensions, and statisticians usually denote them
with subscripts on random variables. Measured variables are what we actually measure in the
study. Fixed variables should come first, followed by measured variables, each ordered so that
related variables are contiguous. Rows can then be ordered by the first variable, breaking
ties with the second and subsequent (fixed) variables. This is the convention adopted by all
tabular displays in this paper.

§ Messy 1: Column headers are values, not variable names



§ Messy 2: Multiple variables stored in one column



§ Messy 3: Variables are stored in both rows and columns



id      year month element d1 d2 d3 d4 d5 ...
MX17004 2010 1     tmax    — — — — — — — —
MX17004 2010 1     tmin    — — — — — — — —
MX17004 2010 2     tmax    — 27.3 24.1 — — — — —
MX17004 2010 2     tmin    — 14.4 14.4 — — — — —
MX17004 2010 3     tmax    — — — — 32.1 — — —
MX17004 2010 3     tmin    — — — — 14.2 — — —
MX17004 2010 4     tmax    — — — — — — — —
MX17004 2010 4     tmin    — — — — — — — —
MX17004 2010 5     tmax    — — — — — — — —
MX17004 2010 5     tmin    — — — — — — — —


id      date       element value
MX17004 2010-01-30 tmax 27.8
MX17004 2010-01-30 tmin 14.5
MX17004 2010-02-02 tmax 27.3
MX17004 2010-02-02 tmin 14.4
MX17004 2010-02-03 tmax 24.1
MX17004 2010-02-03 tmin 14.4
MX17004 2010-02-11 tmax 29.7
MX17004 2010-02-11 tmin 13.4
MX17004 2010-02-23 tmax 29.9
MX17004 2010-02-23 tmin 10.7

id      date       tmax tmin
MX17004 2010-01-30 27.8 14.5
MX17004 2010-02-02 27.3 14.4
MX17004 2010-02-03 24.1 14.4
MX17004 2010-02-11 29.7 13.4
MX17004 2010-02-23 29.9 10.7
MX17004 2010-03-05 32.1 14.2
MX17004 2010-03-10 34.5 16.8
MX17004 2010-03-16 31.1 17.6
MX17004 2010-04-27 36.3 16.7
MX17004 2010-05-27 33.2 18.2


§ Multiple types in one table:


§ data manipulation, relationship to dplyr:



§ Visualization



§ Modelling



§ Questions about performance benching in terms of tidy