Dear readers, this is work in progress! Should you want to highlight or comment on any of the text, simply select that text in the document, either with the mouse or by pressing Shift+[arrow key]; a small menu with your options will open. To be able to store your comments and share them with the public you must register at https://hypothes.is/. Once you’re done commenting, don’t forget to press the little Share button in the comment bar on the right!
1 Motivation
Researchers grapple with many obstacles in their everyday work—one of them is the availability of research data, for instance, from other labs, or even from their own lab. Publicly available data facilitates re-analysis by other researchers, thus allowing for the generation of new hypotheses the original authors possibly never had in mind; or to test-drive new or different analytical methods on existing real-world data; or the estimation of an effect size the original authors did not report because maybe it was not their focal outcome.
Although many journals require authors to make available the data accompanying a publication, this good intention is often derailed by a sentence at the end of the publication, reading something like Data can be obtained from the authors at reasonable request, which often leads to two consequences, (a) it is left at the authors’ discretion to decide which request is reasonable, and (b) it requires the authors to keep the data available, that is provide storage space for years to come and keep the data visible and accessible to interested parties. Several organizations have strove over the years to provide and publicize publicly accessible storage space for scientific data to take this burden away from labs1. Many journals nowadays encourage authors (or make it even mandatory) to publish their data at such repositories. Along with the now increasing number of public repositories, it is easier than ever before for authors to share their data.
1 Here is but a small selection of research data repositories (with country of origin) for our field: Göttingen Research Online.Data (D), Zenodo (CH), Open Science Framework (US)
Re-analyzing somebody else’s data requires knowledge about the data structure. For instance, what was the research question, which experimental conditions were tested, which population were the participants sampled from and how? And so on. Therefore, it does not suffice to publish one’s data, the data also needs to be annotated adequately to ensure that someone else can re-run the analyses and arrive at the same results as published in the corresponding paper. Nowadays, most analyses are carried out by software, so for a researcher to be able to re-run our calculations we need to make clear which programs were used, and which versions. If the analysis involved self-penned code, the commented code needs to be included in the data repository, along with the name and the version of the programming language used.
We fully support this idea of sharing data with the public. Therefore, we require you to store your research data at the IMMM so that future researchers can easily understand and re-use them. By storing structured data in this manner can then easily be managed in a searchable database, both for in-house and external use. To help you with this task this document aims to walk you through an example to give you an idea on how to achieve this feat.
2 Organizing Principles
In the following, we show possible ways to organize your data in a fashion fitting the ideas developed above. The examples shown are mere suggestions, but ones that have proven useful for many people over the years (Spreckelsen et al., 2020). Regardless of the final structure you choose, please ensure that it is clear and comprehensible to outsiders. Additionally, avoid uncritically imitating the practices of your colleagues. Always remember: Your results need to be reproducible by someone else without your help—because you will most likely not be around.
The main principle is to organize your project’s data and files in one single parent folder with a ‘speaking name’ that makes it unique and somehow describes your project. A good name may be an acronym under which the project is known at the IMMM; a short descriptive name would, of course, also do. Hint: “Dystonia” is not a good name for a project, as we have a lot of projects where Dystonia is involved in one way or another. Also, ‘My_dissertation’ is not a good project name for obvious reasons. The title should, as already mentioned, be unique and explanatory.
Once your project folder gets cluttered with too many files consider placing files belonging together in a few sub folders which again should have speaking names. In the parent folder, those sub folders might subsume files into categories, like Planning, Methods, Data, etc. (Figure 1).
Please keep in mind that too many sub folders will again make a folder look cluttered. We therefore suggest to keep the number of items in a sub folder to less than eight (Miller, 1956).
3 Example
All project-related files should be organized within a single project folder, ideally named after the project itself, to facilitate easy recognition and identification. The following paragraphs present suggestions what should (at least) be contained in this project folder, and how things could be organized therein. Now, every project is different, as are the researchers conducting them. Therefore it is not surprising that each project manages its data in a more or less idiosyncratic way since there are no strict guidelines on how to approach it. But for a couple of exceptions noted below, you just need to use your common sense, combined with the ideas introduced here, and everything will be fine.
3.1 Top-level
It is generally deemed as good practice to include a README file on the first level of your project folder that concisely describes the project. You might also want to include in it the institution where the project was located, and the researchers involved, along their contact details (Figure 2).
The README file is preferably a plain text file to make it universally readable, and should have the suffix .md
(for Markdown) to allow for easy formatting (.txt
is fine, too). In the example, this file is called ‘00_ASRDME_A_Stupid_Research_Data_Management_Example.md’, bearing in its name the project’s title, ASRDME, and the acronym’s meaning.
Most of the remaining files of the example project are stored in several folders to organize stuff according to relevant topics.
The README file provides a quick glimpse of a project’s objective(s).
3.2 Research Plan
A research project starts with a planning phase, and therefore a sub folder named 01_Research_plan was created to accomodate documents detailing the research question(s), hypotheses, population of interest, the experimental design, sampling technique, randomization procedure, a list of pertinent literature, an estimated effect size (between groups and/or conditions) and the theoretical and/or practical reasoning towards said effect size(s). It might be a good idea to further divide the contents of ‘01_Research_plan’ to avoid clutter and organize the files. In the example, 01_Research_plan contains the sub folders 01_Ethics and 02_Forms_Questionnaires.
As shown in Figure 3, in some cases it might make more sense to provide the file names with the date they were generated on instead of numbering them in an ascending order2. In case of non-binary files, instead of saving changed files with their names differing only by date, using a version control system like git is a good method to avoid cluttering a folder with too many similarly named files3. Actually, sub folders below level one, and file names in general can go without a leading number (or date) if the order one reads them in is not of utter importance.
2 Whenever calendar dates are used, make sure to use the ISO 8601 date format YYYY-MM-DD, as in 2025-12-31, because this way the date is internationally unambiguous, and computer file managing software shows files with such dates as leading part of their names in chronological (ascending) order.
3 git preserves the changes of a file it controls with time stamps so that their state at any point in time can be reconstructed.
Things that were decided upon before data acquisition started belong in the Research Plan folder.
3.3 Methods
The folder 02_Methods encloses all the methodological aspects relevant to the project that have not been dealt with before. Likely contents for the example project may include considerations regarding treatment options, as well as details on data acquisition and the planned analysis strategy (Figure 4).
Data acquisition refers to anything concerning data recording devices—although technically, obtaining data using questionnaires is also data acquisition. Here, a list of devices used to measure physiological and behavioral data, including manufacturer names, device names, model identifiers, serial numbers, and possibly firmware versions is key for others to replicate your experiments. If a recording software is used, it’s vendor, name and version should also be noted. Schematic diagrams of device wiring and data flow can be very helpful in clearly illustrating how data was collected (Figure 5).
block-beta columns 5 monitor("Monitor") space space space space stimpc["Stimulus<br>PC"] space space space space space space space space space finger["👆🏼"] space tapp["Tapping<br>circuitry"] space space head((("EEG cap"))) space amp["EEG amp"] space pc["Data<br>recording<br>PC"] monitor --> stimpc monitor --> pc finger --> tapp tapp --> pc head --> amp amp --> pc style finger fill:#fff,stroke-width:1px
As a rule of thumb, the Methods folder must contain all the information needed to reproduce the results. Statistically speaking, all the known aspects possibly influencing the data generating process need to be included.
In the Methods folder, include everything that—deliberately or inadvertently—may have influenced the data generating process and is not already in the Research Plan folder.
3.4 Data
In the course of a scientific research project, a lot of data is collected. It is the researchers’ duty to ensure that this data is stored securely, protecting it from tampering or accidental deletion, and to guarantee that access is restricted to authorized individuals only. Since this is quite the feat, this task is usually handed over to qualified personnel operating a tailored infrastructure. Although this takes away one burden, the responsibility to store and comment their research data remains with the researchers.
The data folder must at least contain the raw data (Figure 6). If the data is sufficiently commented and described, and the code to wrangle and/or clean it and run the analyses on it is enclosed in the project folder, the requirements are fulfilled. But if some of these requirements are missing or for whatever reason are not adequately done, resulting in an irreproducible project, intermediary data must necessarily be included in the data folder and the missing intermediate steps need to be described as thoroughly as possible.
The data folder must at least contain the raw data! The better the documentation the less additional data needs to be stored for a project.
Most of the data researchers collect is sooner or later stored electronically. To that end, software is used—either self-penned, open source, or commercially available. In-house developed software (including source code) must be included in the project repository to share. Open source software needs to be named (including version used) along with the website of origin. For commercial software, sharing is strictly prohibited. This legal restriction to share is one reason why data should not (only) be stored in proprietary file formats but also in universally readable formats such as comma-separated value files, or plain text files. For this purpose the protocol describing the exact procedure to export the data from the software must be included in the repository.
Data must not (only) be stored in proprietary file formats.
3.5 Analyses
Any deliberate action changing the raw data in any way must be considered ‘analysis’, as for instance, wrangling and cleansing undoubtedly change the data, and in fact with a certain goal in mind (Figure 7). This goal most often is to facilitate the main analysis, but these changes also influence the outcome of the main analyses. For example, one part of the data cleaning process often is the removal of outliers, which certainly influences the outcome—otherwise there would be no need to consider removing them. So, the outliers are being removed with the hope to arrive at a ‘cleaner’ outcome. This reasoning underlines the obligation to include the raw data—which necessarily includes the outliers—in the Data folder, because future scientists may have differing opinions on how to remove outliers4, and which; we always need to remember that science evolves and, accordingly, methods change over time5.
4 Or whether to remove them at all.
5 In other words: today’s commonly accepted methods will not necessarily stand the test of time.
The most important part of the Analyses folder is, of course, the code to build and test statistical models. Should anyone still be using a point-and-click interface-based statistics package, e.g. SPSS, they need to store the syntax files here.
The Analyses folder includes every bit of code that changes the raw data in any way, including any aggregating and modeling routines.
3.6 Publications
As publicly funded researchers we are obliged to make our research findings public. For the sake of transparency it is advisable to include publication drafts (Figure 8) in a project’s data repository, ideally versioned with a versioning tool like git.
The Publication folder is not part of the to-be-shared data set, but it is good practice to leave a paper trail of the publications you produced at the IMMM for future scientists at the institute.
4 Conclusions
Research data management is a prerequisite to share your data with fellow and future scientists, enabling them to build on your contribution to humanity’s knowledge.
To make data sharing efficient, data should be stored following the FAIR principle: they are findable and accessible in a public repository, avoiding proprietary file formats makes the data interoperable, and well-documented data is reusable by others (Wilkinson et al., 2016).
The researcher can assist in making their data FAIR by following the guidelines given in this report. Before the data can finally be shared with the public, meta-data will be added to make them easily findable by humans and machines. Both this and the actual storage process in a publicly accessible data repository will be taken care of for you6.
6 Data will be published with its own DOI after your paper has been published.