## § Tidy Data

• Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.
While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values. One way of organising variables is by their role in the analysis: are values fixed by the design of the data collection, or are they measured during the course of the experiment? Fixed variables describe the experimental design and are known in advance. Computer scientists often call fixed variables dimensions, and statisticians usually denote them with subscripts on random variables. Measured variables are what we actually measure in the study. Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fixed) variables. This is the convention adopted by all tabular displays in this paper.

#### § Messy 1: Column headers are values, not variable names

• eg. columns are religion |<$10k |$10-20k |$20-30k |$30-40k |$40-50k |$50-75k.
• melt dataset to get molten stacked data.

#### § Messy 2: Multiple variables stored in one column

• This often manifests after melting.
• eg. columns are country | year | m014 | m1524 | .. | f014 | f1524...
• columns represent both sex and age ranges. After metling, we get a single column sexage with entries like m014 or f1524
• The data is still molten , so we should reshape it before it sets into tidy columnlar data. We do this by splitting the column into two, one for age and one for sex.

#### § Messy 3: Variables are stored in both rows and columns

• Original data:
id      year month element d1 d2 d3 d4 d5 ...
MX17004 2010 1     tmax    — — — — — — — —
MX17004 2010 1     tmin    — — — — — — — —
MX17004 2010 2     tmax    — 27.3 24.1 — — — — —
MX17004 2010 2     tmin    — 14.4 14.4 — — — — —
MX17004 2010 3     tmax    — — — — 32.1 — — —
MX17004 2010 3     tmin    — — — — 14.2 — — —
MX17004 2010 4     tmax    — — — — — — — —
MX17004 2010 4     tmin    — — — — — — — —
MX17004 2010 5     tmax    — — — — — — — —
MX17004 2010 5     tmin    — — — — — — — —

• Some variables are in individual columns (id, year, month)
• Some variables are spread across columns (day is spread as d1–d31)
• Some variables are smearted across rows (eg. tmax/tmin). TODO: what does this mean, really?
• First, tidy by collating into date:
id      date       element value
MX17004 2010-01-30 tmax 27.8
MX17004 2010-01-30 tmin 14.5
MX17004 2010-02-02 tmax 27.3
MX17004 2010-02-02 tmin 14.4
MX17004 2010-02-03 tmax 24.1
MX17004 2010-02-03 tmin 14.4
MX17004 2010-02-11 tmax 29.7
MX17004 2010-02-11 tmin 13.4
MX17004 2010-02-23 tmax 29.9
MX17004 2010-02-23 tmin 10.7

• Dataset above is still molten. Must reshape along element to get two columns for max and min. This gives:
id      date       tmax tmin
MX17004 2010-01-30 27.8 14.5
MX17004 2010-02-02 27.3 14.4
MX17004 2010-02-03 24.1 14.4
MX17004 2010-02-11 29.7 13.4
MX17004 2010-02-23 29.9 10.7
MX17004 2010-03-05 32.1 14.2
MX17004 2010-03-10 34.5 16.8
MX17004 2010-03-16 31.1 17.6
MX17004 2010-04-27 36.3 16.7
MX17004 2010-05-27 33.2 18.2

• Months with less than 31 days have structural missing values for the last day(s) of the month.
• The element column is not a variable; it stores the names of variables.

#### § data manipulation, relationship to dplyr:

• mutate() adds new variables that are functions of existing variables
• select() picks variables based on their names.
• filter() picks cases based on their values.
• summarise() reduces multiple values down to a single summary.
• arrange() changes the ordering of the rows.

#### § Visualization

• Most of R's visualization ecosystem is tidy by default.
• base plot, lattice, ggplot are all tidy.

#### § Modelling

• Most modelling tools work best with tidy datasets.

#### § Questions about performance benching in terms of tidy

• Is runs of a program at different performance levels like O1, O2, O3 to be stored as separate columns? Or as a categorical column called "optimization level" with entries stored in separate rows of O1, O2, O3?
• If we go by the tidy rule "Each variable forms a column", then this suggests that optimization level is a variable.
• Then the tidy rule Each observation forms a row. makes us use rows like [foo.test | opt-level=O1 | ] and [foo.test | opt-level=O2 | ].
• Broader question: what is the tidy rule for categorical column?
• However, in the tidy data paper, Table 12, it is advocated to have two columns for tmin and tmax instead of having a column called element with choices tmin, tmax. So it seems to be preferred that if one has a categorical variable, we make its observations into columns.
• This suggests that I order my bench data as [foo.test | O1-runtime=_ | O2-runtime=_ | O3-runtime=_ ].