Data selection

The two major use cases and drivers for what to keep are Research Integrity and Reproducibility (availability of the data supporting the findings in research); and the Potential for Reuse (availability of data for sharing with other users). | Beagrie, 2019

Can we afford not to preserve certain research data? That is the question that is central to the selection of research data for long-term archiving and data publication. Which research data do we archive for verification purposes only? And which datasets do we really make findable and reusable by publishing the (meta)data in a data archive? The criteria are discussed in this section.

Reasons for the retention of research data

There may be several reasons for retaining research data:

The importance of the research data
Potential value for reuse and (inter)national positioning. Quality, originality, size, scale, production costs of the data or, for example, the innovative nature of the research.
The uniqueness of the data
The data include non-repeatable observations.
The importance of data for historical research
The data is important for historical research, especially scientific-historical research.
Other reasons
The research data is important for non-scientific purposes (cultural heritage, museums or presentations).

In addition to these general considerations, research funders such as the Netherlands Organisation for Scientific Research (NWO, n.d.) are increasingly making it compulsory for research data to be retained in order to make re-use possible. The Netherlands Code of Conduct for Research Integrity (VSNU, 2018) also obliges researchers to retain both raw and processed data for a period appropriate to the discipline and methodologies used.

Preconditions

The selection of research data is not only done on the basis of substantive arguments. In addition, there is a whole list of considerations and preconditions that contribute to the arguments for making the final decision. Consider, for example, the following:

Format

In which formats are the data available? Is the data format and software format usable? For (re)usability, data should preferably be stored in sustainable data formats.

Processing phase

What is the processing phase of the data? Raw/unprocessed, semi-processed or published?

Metadata and data documentation

Is enough metadata and datadocumentation available? Is the information of sufficient quality to understand what the data is all about?

Legal demands

Does clarity exist about intellectual property rights, such as copyright or database rights? Are personal data involved? Can they be archived or published as such or are additional measures required?

Sustainable infrastructure

Is there a sustainable infrastructure available for archiving or publishing the data? Think of a data archive or an institutional or thematic repository.

Costs

Are the costs of selecting, archiving, converting, storing and making data available for reuse taken into account? Whether data is archived for the long term remains a consideration of costs and benefits. How do the costs of archiving or publishing relate to the costs of reproducing the research data?

Archive or publish

If the preconditions are met, it is important to decide whether you will:

Archive the data for verification purposes or to keep open the possibility to use the data again in future research.
Publish the data for reuse by (future) others in a data archive or institutional repository.

In the flow chart below, the arguments for making an informed choice are visualised in a simplified manner.

Storage period

Once it has been established that a dataset will be archived or included in a data archive, it is important to determine how long it should be kept. The retention period will depend on the developments in the discipline, the costs of storage and making data accessible, and the expected (re)use potential. Datasets that are regarded as heritage, such as the results of archaeological research, are generally kept for eternity.

If the retention period has not been determined, a decision on permanent archiving will have to be taken after a certain period of time. The report 'Selection of Research Data' (DANS, 2011) mentions a period of 10 years as the time to reconsider whether research data should still be retained or destroyed.

In the spotlight

Studies on selecting research data

More information about selecting research data can be found in the reports:

'What to keep: A JISC research data study' (Beagrie, 2019)
'Selection of Research Data, Guidelines for appraising and selecting research data’ (DANS, 2011)

Cases with data that meet the selection requirements

The Cabauw radar data in 4TU.Research Data (n.d.) is a clear example of data that meets the selection criteria. The radar data contain information about the influence of dust particles on cloud formation. These measurements can only be done once and provide valuable information about climate change. In addition to the processed data, raw data is also stored for these data. The argument for storing raw data is that it may contain information that we cannot extract from it yet.
Interview projects can be classified under research which is difficult to repeat. Recordings of the personal experiences of, for example, the Second World War is often a matter of "now or never" due to the age of the interviewees. DANS has a lot of interview data in its collections Oral History (DANS, 2012) and World War II (DANS, n.d.) that are a valuable source for historical research, now and in the future. These interviews are kept behind the scenes in large format, to be regarded as the "raw data" and shown as MP4 via EASY.
Also, for example, the data that is now being collected at the Large Hadron Collider (particle accelerator), can't afford to be lost (CERN, n.d.).

Case from a student of Essentials 4 Data Support

A student left the following comment making clear that the considerations for retention aren't always straightforward.

"I have an example that does not fit in with the mentioned cases, and for which is difficult to find the optimal solution. We perform experiments that produce massive amounts of data. The experiments are difficult and expensive, suggesting that it is a good idea to store this raw data. However, the data is not usable in the original format and needs to be preprocessed, which greatly reduces its quantity. The preprocessed data is used for our analyses and publications, so if colleagues want to verify our data, they would also need our preprocessed data sets. It therefore seems more sensible to store this data for the long term, also with respect to the costs of storage. In addition, we expect the data acquisition to continuously improve in quality. So, in five years or less the raw data we now have may be very inferior to what we can record in the future. However, the preprocessing algorithms are also developing, and other researchers might be more interested in applying these to our datasets. Moreover the experiments we have performed are unlikely to be redone in the future because of the costs involved". | Chris van der Togt, 2018

Guides for archiving and publishing data

RDM Support at Utrecht University in the Netherlands offers two how-to-guides for:

Archiving data (Utrecht University, n.d.a.)
Publishing data (Utrecht University, .n.d.b.)

RDNL services for archiving and publishing data

The next section contains an infographic with RDNL services for archiving and publishing data.

Sources

Click to open/close

4TU.ResearchData (n.d.). Atmospheric Observation Collection Cabauw. https://data.4tu.nl/collections/Atmospheric_observations_IDRA_Cabauw/5065367

Beagrie, N. (2019). What to Keep: A Jisc research data study. http://repository.jisc.ac.uk/7262/1/JR0100_WHAT_RESEARCH_DATA_TO_KEEP_FEB2019_v5_WEB.pdf

CERN (n.d.). CERN Open data portal. http://opendata.cern.ch/

DANS (2012): Thematische collectie: Oral History. https://doi.org/10.17026/dans-z3c-f26d

DANS (n.d.). Collectie Tweede Wereldoorlog. https://easy.dans.knaw.nl/ui/?wicket:bookmarkablePage=:nl.knaw.dans.easy.web.search.pages.PublicSearchResultPage&q=collectie+tweede+wereldoorlog

Gibney, E. (2013, November 26). LHC Plans for open data future. Nature News. http://www.nature.com/news/lhc-plans-for-open-data-future-1.14244

NASA. (2011). Astronomers find elusive planets in decade old hubble-data. http://www.nasa.gov/mission_pages/hubble/science/elusive-planets.html

NWO (n.d.) Open science. https://www.nwo.nl/en/policies/open+science

Tjalsma, H.; Rombouts, J. (2011). Selection of research data - Guidelines for appraising and selecting research data. Retrieved from from http://www.dans.knaw.nl/nl/over/organisatie-beleid/publicaties/DANSselectionofresearchdata.pdf

Utrecht University (n.d.a.). Storing and preserving data. RDM Support. [Guide]. https://www.uu.nl/en/research/research-data-management/guides/storing-and-preserving-data

Utrecht University (n.d.b.). Publishing and sharing data. RDM Support. [Guide]. https://www.uu.nl/en/research/research-data-management/guides/publishing-and-sharing-data

VSNU (2018). The Netherlands Code of Conduct for Research Integrity. https://doi.org/10.17026/dans-2cj-nvwu.

IV - Harvest phase

Omhoog

Archive or publish at RDNL