Storage media, software, hardware and operating systems are aligned. For a correct representation of a digital object, the combination of all four must be correct. However, new versions of all variables appear at a rapid pace and there is always a risk of information being lost. | Netwerk Digitaal Erfgoed, n.d.
Being able to continue reading, displaying and processing research data requires knowledge about the data format used and the software with which it can be opened and recognised. We will zoom in on this in this section.
Diversity in file formats
Research data comes in many forms and sizes: text, numerical data, models, software, code, multimedia, photos, etc. In addition, there are discipline-specific research data or data that are characteristic of the instrument with which they are measured.
Data is stored in a specific file format. The information in a file format is encoded in such a way that a certain type of software can recognise, read and use the file. A particular file format is often referred to as a three- or four-letter file extension that identifies the software used.
Choosing future-proof data formats
Not all file formats are equally well prepared for the future. The following considerations are important when choosing a particular type of software and file format:
Is an open, standardised data format available?
Can open source software be used?
Can the data be stored in an open, standardised data format?
The documentation of open, standardised data formats is freely available so that anyone who wants to can (re)write the software needed to read the data themselves. It can also be used free of license fees and is supported by several software vendors and/or open source initiatives. An open file format can also originate from a commercial company. An example is PDF/A. Although this is a product of the commercial company Adobe, it does have open specifications. That is why it is an open file format.
Please note that open file formats may lack a number of specific functions that are specific to the commercial software product whose specifications have not been published. In that case, it makes sense to store the data in both data formats: the open data format and the closed commercial file format.
Examples of open data formats are PDF/A, .csv, .odf, .xml, .rstat, .mxf, etc. PDF/A Has been specially developed for long-term accessibility (British Library, 2019).
Can the data format be opened by other software?
New operating systems, software and hardware appear in rapid succession. It is not self-evident that the new versions support the use of files created with earlier versions. At the same time, a lot of software is able to open file formats of other software. OpenOffice opens Microsoft formats; QGIS opens GIS data from ArcGIS. The fact that not every format is stable within its own software environment is a major concern. But if a data format receives good support outside its own software environment, this will obviously make a huge difference to its future-proof potential.
The interoperability of data is greater if researchers use the same format, or if one person's software can also read and edit the other person's format. This applies both during research and for a long time afterwards.
When is it legitimate to choose a closed data format?
For many areas and applications, such as Computer Aided Design (CAD) or Virtual Reality, no open data formats are available at all. In this case, try to choose a commonly used data format.
It is also possible that open file formats lack a number of specific functions that are specific to their closed variant (the commercial software product whose specifications have not been published). In that case, it may be useful to store the data in both data formats: the open data format and the closed commercial file format.
Examples of closed data formats are .xlsx, .pptx, .stata, .mov and .dxf.
Unsure about a future-proof data format? Take a look at the preferred formats of data archives
Data archives are committed to long-term accessibility of research data. The file format in which data is stored is of great importance in ensuring future proof potential. This is why data archives often publish lists of preferred formats. These are data formats that are preferably independent of specific software and of which the specifications are well documented and openly available. Visit the page 'Preferred formats' for examples.
Click to open/close
4TU.ResearchData (n.d.). 4TU SEARCH. https://data.4tu.nl/portal
British Library (2019). PDF Format Preservation Assessment Part 2: PDF/A Profile. http://wiki.dpconline.org/images/2/22/PDFA_Assessment_v1.0.pdf
Netwerk Digitaal Erfgoed (n.d.). Leren Preserveren. Beperkt Houdbaar [Cursus]. https://lerenpreserveren.nl/topic/beperkthoudbaar/
netherlands eScience Center (n.d.). Research Software Directory. https://www.research-software.nl/
UK Data Archive. (2011). Managing and sharing data. Retrieved from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf