Journal of Statistical Software
MMMMMM YYYY, Volume VV, Issue II.
A huge amount of effort is spent cleaning data to get it ready for analysis, but there
has been little research on how to make data cleaning as easy and effective as possible.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table. This framework makes it easy to tidy messy datasets because only a small
set of tools are needed to deal with a wide range of un-tidy datasets. This structure
also makes it easier to develop tidy tools for data analysis, tools that both input and
output tidy datasets. The advantages of a consistent data structure and matching tools
are demonstrated with a case study free from mundane data manipulation chores.
Keywords: data cleaning, data tidying, relational databases, R.
It is often said that 80% of data analysis is spent on the process of cleaning and preparing
the data (Dasu and Johnson 2003). Data preparation is not just a first step, but must be
repeated many over the course of analysis as new problems come to light or new data is
collected. Despite the amount of time it takes, there has been surprisingly little research
on how to clean data well. Part of the challenge is the breadth of activities it encompasses:
from outlier checking, to date parsing, to missing value imputation. To get a handle on the
problem, this paper focusses on a small, but important, aspect of data cleaning that I call
data tidying: structuring datasets to facilitate analysis.
The principles of tidy data provide a standard way to organise data values within a dataset.
A standard makes initial data cleaning easier because you don’t need to start from scratch
and reinvent the wheel every time. The tidy data standard has been designed to facilitate
initial exploration and analysis of the data, and to simplify the development of data analysis
tools that work well together. Current tools often require translation. You have to spend time