13 Metadata

13.1 What metadata are (and are not)

Metadata are data about data.

This definition is technically correct, but not very useful on its own. A more practical way to think about metadata is this:

Metadata describe the context in which data were generated.

Metadata record the information needed to interpret measurements correctly. They tell us:

who collected the data
when and where they were collected
how measurements were made
what files belong together
what assumptions, constraints, or limitations existed upstream

Metadata are not bureaucracy. They are scientific memory. Without metadata, data quickly lose meaning—especially once they leave the hands of the person who collected them.

13.2 Why metadata are essential

Raw data without metadata are often difficult—or impossible—to interpret, even by the original researcher. In practice, many irreproducible analyses fail before modeling begins, because key contextual information was never recorded. A column of numbers without units, provenance, or spatial and temporal context is just a sequence of values. It is not reusable scientific data.

At a broad level, good metadata allow others (and future you) to:

understand what the data actually represent, and what their limitations are
judge whether the data are appropriate for a new purpose or analysis
reconstruct and evaluate analytical decisions made downstream

Metadata do not guarantee good science, but their absence almost guarantees future confusion.

13.3 Metadata within a reproducible analytical workflow

Reproducibility is not just about being able to rerun code indefinitely. It also requires being able to answer core questions about the data themselves, such as:

What exactly was measured?
Under what conditions were the measurements taken?
What instruments or protocols were used?
What were the data’s limitations or sources of error?

Metadata provide the necessary bridge between raw data and scientific inference. They explain how measurements came to exist and clarify which kinds of estimands the data can —and cannot— support.

Within the broader analytical workflow, raw data and metadata together form a complete dataset. Complete datasets can be verified, versioned, shared, and reused. All subsequent processing, modeling, and inference depend on this foundation.

To emphasize, metadata are not an afterthought; they are part of the data.

13.4 Types of metadata: what different layers do

In practice, metadata operate at several conceptual layers. Each layer answers a different scientific question and constrains how hypotheses and models can be formulated.

Together:
Structural metadata define what exists.
Measurement metadata define what it represents.
Contextual metadata define what it can be used to infer.

Structural Metadata

What is in the dataset

Structural metadata describe the literal structure of the data file: variable names, units, data types, allowed values, and how missing values are encoded. This layer tells a reader (and software) what each column represents and how it should be interpreted at a basic level.

Structural metadata directly constrain hypotheses and models by determining:

what variables exist
what scales and transformations are meaningful
what distributional assumptions are even plausible

Structural metadata define what questions can be expressed with the data at all.

Measurement Metadata

How values were generated

Measurement metadata describe the process by which values came into existence: instruments or observers, protocols, calibration or standardization, accuracy, resolution, and the role of human judgment versus sensors.

This layer determines whether variables are:

direct measurements or proxies
comparable across space, time, or observers
subject to detection error or measurement error

Measurement metadata shape hypotheses by clarifying what the variables actually represent in the real world, and they constrain model choice by determining whether additional structure (e.g., detection submodels) is required.

Measurement metadata define what your variables mean.

Contextual Metadata

Why the dataset looks the way it does

Contextual metadata describe the study design and constraints that shaped the dataset: sampling units, spatial and temporal scope, missingness mechanisms, access limitations, and known biases.

This layer determines:

which comparisons are valid
whether observations are independent or repeated
whether certain hypotheses are testable at all

Contextual metadata are where limitations become explicit and where interpretive boundaries are drawn.

Contextual metadata define what inferences the data can support.

13.5 What good metadata enable (FAIR, in plain language)

At their core, the FAIR principles (Wilkinson et al. (2016)) recognize that data are only useful if their context travels with them. Metadata are what make this possible. Much of the modern emphasis on metadata comes from the FAIR principles, which aim to ensure that scientific data are:

Findable: Metadata help other people find that the data exist.
Accessible: Metadata explain how to get the data and what the files mean.
Interoperable: Metadata make it clear how the data are formatted and measured, so they work with other data.
Reusable: Metadata explain how the data were collected and what their limits are, so others can use them correctly.

Crucially, FAIR is not about ensuring perfection; it is about increasing the probability of future usability.

Note

You are not expected to memorize the FAIR acronym. What matters most is understanding what FAIR is trying to do (specifically, what it is trying to protect)!

13.6 A low-friction approach to metadata

You have probably encountered the long list of formal metadata standards used across scientific disciplines. For this course, you do not need to master any of them, and, in fact, you barely need to know them at all.

Our approach to metadata is intentionally low friction. That means it fits naturally into you (and others) already work: without extra tools or unnecessary extra steps, and without specialized expertise. The goal is not blind compliance to reporting standards; first and foremost, it is about scientific clarity. We therefore adopt a simple guiding principle:

Let repositories handle standardization. Your job is to record clear, honest, human-readable metadata.

Repositories such as Dataverse, Zenodo, Dryad, GBIF, and institutional archives are designed to translate user-supplied documentation into formal metadata schemas. That translation only works, however, if the essential information about your data exists in the first place. What matters most at this stage is clarity and completeness, not adherence to a specific –and often over-specified– standard.

13.7 What “good enough” metadata must accomplish

At a minimum, metadata should allow another scientifically literate person—someone outside your project—to answer the following questions:

What are these data?
Who created them?
When and where were they collected?
How were the measurements made?
What files belong together, and how?

If these questions cannot be answered from the metadata alone, the dataset is not reusable—no matter how sophisticated the analysis or modeling may be.

13.8 Practical formats for early-stage metadata

In everyday research workflows, metadata often begin life in simple, familiar formats such as:

Google Sheet (good for coordinating field or lab work)
plain-text or Markdown file (good for small or exploratory projects)

These formats are perfectly acceptable as starting points. They lower the barrier to documentation and encourage metadata to be written early in the analysis life-cycle rather than postponed. However, these format are usually best treated as temporary representations. As projects grow, metadata need to become more structured, more explicit, and easier to validate and reuse. But, if you are more comfortable beginning your journey using a version-controlled spreadsheet platform (like Google Sheets), you are welcome to do so.

13.9 Why we use YAML

To support that transition from your mind (or Google Sheets), this course adopts YAML as the canonical format for project-level metadata.

YAML: What’s in a name?

Originally, YAML was an abbreviation for “Yet Another Markup Language”. Later, it become “YAML Ain’t Markup Language”.

What the heck is canonical in this context?

Canonical simply means that there is exactly one place where your dataset is formally defined; it does not mean that your metadata exists as exactly one file (though it could).

There are several advantages to using YAML, including:

human-readable and easy to edit
structured and explicit about relationships
easy to version-control
simple to validate and extend (i.e. Quarto markdown files have YAML headers)
straightforward to convert into repository-specific schemas later

Most importantly, YAML encourages you to think carefully about what varies, what stays constant, and how different pieces of a dataset relate to one another.

In the next section, we will walk through the structure of a well-designed YAML metadata file, using concrete examples and expandable templates to show how common research scenarios —such as multiple instruments, deployments, or sampling rates— can be documented clearly and correctly.

13.10 Levels of metadata: where information belongs (and where it does not)

When you document metadata, your main job is to put information at the right level of scope. This keeps metadata clear, avoids duplication, and prevents the most common failure mode: collapsing variation.

A useful way to think about this is a hierarchy of levels, from broad context to individual files.

Dataset-level metadata (the whole project)

This level answers: What is this data set, broadly, and how should it be cited, discovered, and reused?

What belongs at this level

title, description, and keywords
creators, affiliations, persistent identifiers (e.g., ORCID)
overall spatial and temporal scope (broad bounds, if applicable)
license and usage rights
related identifiers (publications, code repositories, DOIs)
high-level description of how the data were generated
provenance summary (e.g., “data collection + processing workflow”)

What does not belong at this level

file-specific settings or parameters
values that vary across observations or files
derived results (means, medians, model outputs)
vague summaries such as “most values were…”

Instrument- or system-level metadata (the measurement system)

This level answers: What system, instrument, or process produced the measurements, and what are its stable properties?

What belongs at this level

system or instrument type, manufacturer, and model
serial number, asset ID, or logical identifier
software or firmware version (if relevant)
properties that are stable across use (e.g., resolution, precision, bit depth)

What does not belong at this level

settings that vary across observations unless explicitly declared as defaults
time- or location-specific information

Deployment- or configuration-level metadata (a consistent setup)

This level answers: When and under what conditions was the system used with a consistent configuration?

What belongs at this level

configuration or deployment identifier
contextual identifiers (e.g., site, batch, experiment, run)
start and end dates or times
parameters that were constant during this configuration
calibration notes, protocols, or procedural references

What does not belong at this level

individual file or observation exceptions unless explicitly mapped
values that vary within the configuration without structure

File- or observation-level metadata (individual data units)

This level answers: What is true about this specific file, record, or observation?

What belongs at this level

filename or observation identifier
date and time of acquisition or creation
parameter values that vary (e.g., sampling rate, resolution, settings)
contextual attributes that differ across observations
location information, if variable

What does not belong at this level

dataset-wide descriptions
interpretive judgments (e.g., “high quality”) unless defined by a controlled scheme

Two rules that prevent most metadata mistakes

Place metadata at the highest level where they are constant.
If a sampling rate never changes across a deployment, store it at the deployment or instrument level.
If a metadata value varies, do not summarize it; map it explicitly by adding another metadata entry.
Never write “sampling_rate_hz: 48000” plus “some files differ.” Instead, record which files have which values.

13.11 Wrapping this up

Clear and clean metadata describe what your data are and how they were generated. Just as importantly, the names you give to files, variables, and folders determine whether that information remains interpretable as projects grow. In the next section, we turn to the topic of naming conventions and coding style —small and seemingly trivial decisions that play an outsized role in scientific reproducibility. Let’s proceed!