How to Manage and Organize Research Data

Editing and Highlighting this Text

Dear readers, this is work in progress! Should you want to highlight or comment on any of the text, simply select that text in the document, either with the mouse or by pressing Shift+[arrow key]; a small menu with your options will open. To be able to store your comments and share them with the public you must register at https://hypothes.is/. Once you’re done commenting, don’t forget to press the little Share button in the comment bar on the right!

1 Motivation

Researchers grapple with many obstacles in their everyday work—one of them is the availability of research data, for instance, from other labs, or even from their own lab. Publicly available data facilitates re-analysis by other researchers, thus allowing for the generation of new hypotheses the original authors possibly never had in mind; or to test-drive new or different analytical methods on existing real-world data; or the estimation of an effect size the original authors did not report because maybe it was not their focal outcome.

Although many journals require authors to make available the data accompanying a publication, this good intention is often derailed by a sentence at the end of the publication, reading something like Data can be obtained from the authors at reasonable request, which often leads to two consequences, (a) it is left at the authors’ discretion to decide which request is reasonable, and (b) it requires the authors to keep the data available, that is provide storage space for years to come and keep the data visible and accessible to interested parties. Several organizations have strove over the years to provide and publicize publicly accessible storage space for scientific data to take this burden away from labs¹. Many journals nowadays encourage authors (or make it even mandatory) to publish their data at such repositories. Along with the now increasing number of public repositories, it is easier than ever before for authors to share their data.

¹ Here is but a small selection of research data repositories (with country of origin) for our field: Göttingen Research Online.Data (D), Zenodo (CH), Open Science Framework (US)

Re-analyzing somebody else’s data requires knowledge about the data structure. For instance, what was the research question, which experimental conditions were tested, which population were the participants sampled from and how? And so on. Therefore, it does not suffice to publish one’s data, the data also needs to be annotated adequately to ensure that someone else can re-run the analyses and arrive at the same results as published in the corresponding paper. Nowadays, most analyses are carried out by software, so for a researcher to be able to re-run our calculations we need to make clear which programs were used, and which versions. If the analysis involved self-penned code, the commented code needs to be included in the data repository, along with the name and the version of the programming language used.

We fully support this idea of sharing data with the public. Therefore, we require you to store your research data at the IMMM so that future researchers can easily understand and re-use them. By storing structured data in this manner can then easily be managed in a searchable database, both for in-house and external use. To help you with this task this document aims to walk you through an example to give you an idea on how to achieve this feat.

2 Organizing Principles

In the following, we show possible ways to organize your data in a fashion fitting the ideas developed above. The examples shown are mere suggestions, but ones that have proven useful for many people over the years (Spreckelsen et al., 2020). Regardless of the final structure you choose, please ensure that it is clear and comprehensible to outsiders. Additionally, avoid uncritically imitating the practices of your colleagues. Always remember: Your results need to be reproducible by someone else without your help—because you will most likely not be around.

The main principle is to organize your project’s data and files in one single parent folder with a ‘speaking name’ that makes it unique and somehow describes your project. A good name may be an acronym under which the project is known at the IMMM; a short descriptive name would, of course, also do. Hint: “Dystonia” is not a good name for a project, as we have a lot of projects where Dystonia is involved in one way or another. Also, ‘My_dissertation’ is not a good project name for obvious reasons. The title should, as already mentioned, be unique and explanatory.

Once your project folder gets cluttered with too many files consider placing files belonging together in a few sub folders which again should have speaking names. In the parent folder, those sub folders might subsume files into categories, like Planning, Methods, Data, etc. (Figure 1).

Figure 1: Top-level view of the ficticious ADRSME project in a file manager. Folders are numbered so that file managers show them in the intended order. Folder names describe the topic of the files contained in them. The only file visible at this level is a README describing the project (see below). **Click on figure to see a larger version.**

Please keep in mind that too many sub folders will again make a folder look cluttered. We therefore suggest to keep the number of items in a sub folder to less than eight (Miller, 1956).

3 Example

All project-related files should be organized within a single project folder, ideally named after the project itself, to facilitate easy recognition and identification. The following paragraphs present suggestions what should (at least) be contained in this project folder, and how things could be organized therein. Now, every project is different, as are the researchers conducting them. Therefore it is not surprising that each project manages its data in a more or less idiosyncratic way since there are no strict guidelines on how to approach it. But for a couple of exceptions noted below, you just need to use your common sense, combined with the ideas introduced here, and everything will be fine.

3.1 Top-level

It is generally deemed as good practice to include a README file on the first level of your project folder that concisely describes the project. You might also want to include in it the institution where the project was located, and the researchers involved, along their contact details (Figure 2).

Figure 2: The README file describing the example project in a Markdown editor (left pane), and its rendered version (right pane). **Click on figure to see a larger version.**

The README file is preferably a plain text file to make it universally readable, and should have the suffix .md (for Markdown) to allow for easy formatting (.txt is fine, too). In the example, this file is called ‘00_ASRDME_A_Stupid_Research_Data_Management_Example.md’, bearing in its name the project’s title, ASRDME, and the acronym’s meaning.

Most of the remaining files of the example project are stored in several folders to organize stuff according to relevant topics.

Note

The README file provides a quick glimpse of a project’s objective(s).

3.2 Research Plan

A research project starts with a planning phase, and therefore a sub folder named 01_Research_plan was created to accomodate documents detailing the research question(s), hypotheses, population of interest, the experimental design, sampling technique, randomization procedure, a list of pertinent literature, an estimated effect size (between groups and/or conditions) and the theoretical and/or practical reasoning towards said effect size(s). It might be a good idea to further divide the contents of ‘01_Research_plan’ to avoid clutter and organize the files. In the example, 01_Research_plan contains the sub folders 01_Ethics and 02_Forms_Questionnaires.

Figure 3: *The* 01_Research_plan *folder contains several documents and folders with considerations pertaining to topics relevant during the planning phase of a project. The folder contained therein*, 01_Ethics, *contains the (versions of your) ethics application, whereas the* 02_Forms_Questionnaires *folder has the forms and questionnaires study participants have to fill out during the experiment.Click on figure to see a larger version.*

As shown in Figure 3, in some cases it might make more sense to provide the file names with the date they were generated on instead of numbering them in an ascending order². In case of non-binary files, instead of saving changed files with their names differing only by date, using a version control system like git is a good method to avoid cluttering a folder with too many similarly named files³. Actually, sub folders below level one, and file names in general can go without a leading number (or date) if the order one reads them in is not of utter importance.

² Whenever calendar dates are used, make sure to use the ISO 8601 date format YYYY-MM-DD, as in 2025-12-31, because this way the date is internationally unambiguous, and computer file managing software shows files with such dates as leading part of their names in chronological (ascending) order.

³ git preserves the changes of a file it controls with time stamps so that their state at any point in time can be reconstructed.

Note

Things that were decided upon before data acquisition started belong in the Research Plan folder.

3.3 Methods

The folder 02_Methods encloses all the methodological aspects relevant to the project that have not been dealt with before. Likely contents for the example project may include considerations regarding treatment options, as well as details on data acquisition and the planned analysis strategy (Figure 4).

Figure 4: *Any methodological information should be stored in a folder named, e.g.,* 02_Methods. This includes devices, analytical methods, as well as treatments. **Click on figure to see a larger version.**

Data acquisition refers to anything concerning data recording devices—although technically, obtaining data using questionnaires is also data acquisition. Here, a list of devices used to measure physiological and behavioral data, including manufacturer names, device names, model identifiers, serial numbers, and possibly firmware versions is key for others to replicate your experiments. If a recording software is used, it’s vendor, name and version should also be noted. Schematic diagrams of device wiring and data flow can be very helpful in clearly illustrating how data was collected (Figure 5).

block-beta     
    columns 5
    monitor("Monitor") space space space space
    stimpc["Stimulus<br>PC"] space space space space
    space space space space space
    finger["👆🏼"] space tapp["Tapping<br>circuitry"] space space
    head((("EEG cap"))) space amp["EEG amp"] space pc["Data<br>recording<br>PC"]
  monitor --> stimpc
  monitor --> pc
    finger --> tapp
    tapp --> pc
    head --> amp
    amp --> pc
    style finger fill:#fff,stroke-width:1px

Figure 5: Block diagram of data acquisition in the experiment. The stimulus PC instructed the participants to either tap in synchrony or asynchronuously to the metronome click which was paced at 90 bpm. Metronome clicks and trial beginnings and ending times were sent as trigger codes to the recording PC. Finger tapps were recorded by a custom-built tapping platform and the corresponding circuitry and then sent to the data recording PC. 64-channel EEG was recorded using a XYZ system which was also connected to the data recording PC. For details of the custom tapping device, please see file tapping_device.pdf in folder 02_Methods/02_Data_acquisition/.

As a rule of thumb, the Methods folder must contain all the information needed to reproduce the results. Statistically speaking, all the known aspects possibly influencing the data generating process need to be included.

Note

In the Methods folder, include everything that—deliberately or inadvertently—may have influenced the data generating process and is not already in the Research Plan folder.

3.4 Data

In the course of a scientific research project, a lot of data is collected. It is the researchers’ duty to ensure that this data is stored securely, protecting it from tampering or accidental deletion, and to guarantee that access is restricted to authorized individuals only. Since this is quite the feat, this task is usually handed over to qualified personnel operating a tailored infrastructure. Although this takes away one burden, the responsibility to store and comment their research data remains with the researchers.

The data folder must at least contain the raw data (Figure 6). If the data is sufficiently commented and described, and the code to wrangle and/or clean it and run the analyses on it is enclosed in the project folder, the requirements are fulfilled. But if some of these requirements are missing or for whatever reason are not adequately done, resulting in an irreproducible project, intermediary data must necessarily be included in the data folder and the missing intermediate steps need to be described as thoroughly as possible.

Note

The data folder must at least contain the raw data! The better the documentation the less additional data needs to be stored for a project.

Figure 6: The data stored along with the project should ideally include all the intermediate stages of data processing—from raw data to the final cleaned data set on which the statistical analyses in the accompanying paper are based. The absolute minimal requirement is that the unchanged raw data is included (any of the intermediary stages up to the final data set should be reconstructible using the code in the folder containing the code to wrangle and clean the raw data). **Click on figure to see a larger version.**

Most of the data researchers collect is sooner or later stored electronically. To that end, software is used—either self-penned, open source, or commercially available. In-house developed software (including source code) must be included in the project repository to share. Open source software needs to be named (including version used) along with the website of origin. For commercial software, sharing is strictly prohibited. This legal restriction to share is one reason why data should not (only) be stored in proprietary file formats but also in universally readable formats such as comma-separated value files, or plain text files. For this purpose the protocol describing the exact procedure to export the data from the software must be included in the repository. As already mentioned, tabular data can most easily be made universally readable by exporting it as comma- or tab-separated ASCII file; spreadsheet software as well as programming languages support writing such files. For very high-frequency and/or many sensor-data—e.g. EEG, fMRI—this can be challenging, and the Extensible Data Format (XDF) lends itself to make files both interoperable, accessible, and readable. Many data acquisition programs offer the ability to export data in XDF format, plus there are modules that make working with XDF files easier.

Note

Data must not (only) be stored in proprietary file formats.

3.5 Analyses

Any deliberate action changing the raw data in any way must be considered ‘analysis’, as for instance, wrangling and cleansing undoubtedly change the data, and in fact with a certain goal in mind (Figure 7). This goal most often is to facilitate the main analysis, but these changes also influence the outcome of the main analyses. For example, one part of the data cleaning process often is the removal of outliers, which certainly influences the outcome—otherwise there would be no need to consider removing them. So, the outliers are being removed with the hope to arrive at a ‘cleaner’ outcome. This reasoning underlines the obligation to include the raw data—which necessarily includes the outliers—in the Data folder, because future scientists may have differing opinions on how to remove outliers⁴, and which; we always need to remember that science evolves and, accordingly, methods change over time⁵.

⁴ Or whether to remove them at all.

⁵ In other words: today’s commonly accepted methods will not necessarily stand the test of time.

Figure 7: The folder containing the analysis code is divided into sub folders with code to wrangle and clean the raw data, to conduct any exploratory analysis, and the actual analyses. **Click on figure to see a larger version.**

The most important part of the Analyses folder is, of course, the code to build and test statistical models. Should anyone still be using a point-and-click interface-based statistics package, e.g. SPSS, they need to store the syntax files here.

Note

The Analyses folder includes every bit of code that changes the raw data in any way, including any aggregating and modeling routines.

3.6 Publications

As publicly funded researchers we are obliged to make our research findings public. For the sake of transparency it is advisable to include publication drafts (Figure 8) in a project’s data repository, ideally versioned with a versioning tool like git.

Figure 8: The publications folder in this example has a sub folder for the monographic dissertation, a methods paper and a ‘main’ paper. **Click on figure to see a larger version.**

The Publication folder is not part of the to-be-shared data set, but it is good practice to leave a paper trail of the publications you produced at the IMMM for future scientists at the institute.

4 Conclusions

Research data management is a prerequisite to share your data with fellow and future scientists, enabling them to build on your contribution to humanity’s knowledge.

To make data sharing efficient, data should be stored following the FAIR principle: they are findable and accessible in a public repository, avoiding proprietary file formats makes the data interoperable, and well-documented data is reusable by others (Wilkinson et al., 2016).

The researcher can assist in making their data FAIR by following the guidelines given in this report. Before the data can finally be shared with the public, meta-data will be added to make them easily findable by humans and machines. Both this and the actual storage process in a publicly accessible data repository will be taken care of for you⁶.

⁶ Data will be published with its own DOI after your paper has been published.

5 References

Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review 63, 81–97. doi: 10.1037/h0043158

Spreckelsen, F., Rüchardt, B., Lebert, J., Luther, S., Parlitz, U., and Schlemmer, A. (2020). Guidelines for a standardized filesystem layout for scientific data. Data 5, 43. doi: 10.3390/data5020043

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3. doi: 10.1038/sdata.2016.18