15 Data Exploration
15.1 Introduction
Consider the following scenario. Being an expert in jackalope analytics, you are asked (by some really awesome colleagues) to analyze a dataset that you have never seen. You meet with your colleagues; they describe the data, their goals, and their timeline; you send a few appropriate follow-up questions. They answer. All seems good. You jump right into analysis. Two weeks (!) into a complex workflow, you discover two small details they “forgot” to mention: (1) half the data are questionable because, unbeknowst to them, a jackalope infiltrated the field crew and collected data for a month, and (2) several samples were contaminated with jackalope saliva.
Because you skipped the critical data exploration step, you just injected a suite of bad assumptions into everything downstream. The lesson is oddly simple: take your time and be systematic in data exploration. In this course, the goal is not to rush into running statistical models or jump to drawing strong inference, but rather to catch problems early – before they derail your modeling. Even if you collected the data yourself and feel like you know them inside and out, do not assume that you can or should bypass this critical step. Data Demons have great timing and even greater motivation to see you fail, especially near the end of a project or a degree.
This week, we purposefully slow the workflow down and practice exploring datasets so that problems like these show up early –long before they shape your model decisions or waste weeks of work. In the next pages, we will walk through core data exploration steps that should always come before formal modeling.
In simple terms, what is data exploration? Data exploration is where you learn what your data can say — before you ask them (not tell them) to say it.
15.2 Goals for this section (Week 3)
By the end of this week, you should be able to:
- Describe some of the advantages and disadvantages of several types of data exploration approaches (before and after modeling)
- Examine your own dataset for outliers
- Justify the removal of outliers
- Realize the importance of taking your good ole’ sweet time with data exploration
15.4 Core Benefits of Data Exploration
Let us delve into the core advantages of a formal data exploration phase. Data exploration determines whether your data, assumptions, and planned analysis are scientifically defensible. Each section below represents a distinct utility of the exploration phase.
15.5 Summary: What data exploration is and is not
What is data exploration used for?
Data exploration exists to decide whether your data are fit for an analysis and whether that analysis is defensible. Use this phase to ask:
- Do these data behave like they were measured the way I think they were?
- Are there obvious data biases or limits to precision or accuracy?
- Do variables mean what their names suggest (i.e. following good nomenclature rules)?
- Do groups, measurement units, and temporal and spatial scales make scientific sense?
- Do any of the data already cause model failure?
If you cannot clearly explain what you learned during data exploration, you are not ready to move on to the data modeling phase.
What Data Exploration is not
Data exploration should not be used for:
- Fishing for significant results (i.e. omitting extreme values to change your model results)
- Generating pretty plots that do not reflect model results
- Running PCA because there are many variables
- Deleting outliers without understanding why they exist (e.g. not caring about data-generating processes)
- A formality you rush through to get to “the real analysis”
If exploration feels boring or unnecessary, you are either skipping it or not doing it well.
In the next section, we go over some approaches of a standard data exploration toolkit that will allow you understand the structure and limits of your dataset.