Data formats

Storage media, software, hardware and operating systems are aligned. For a correct representation of a digital object, the combination of all four must be correct. However, new versions of all variables appear at a rapid pace and there is always a risk of information being lost. | Netwerk Digitaal Erfgoed, n.d.

Being able to continue reading, displaying and processing research data requires knowledge about the data format used and the software with which it can be opened and recognised. We will zoom in on this in this section.

Diversity in file formats

Research data comes in many forms and sizes: text, numerical data, models, software, code, multimedia, photos, etc. In addition, there are discipline-specific research data or data that are characteristic of the instrument with which they are measured.

Data is stored in a specific file format. The information in a file format is encoded in such a way that a certain type of software can recognise, read and use the file. A particular file format is often referred to as a three- or four-letter file extension that identifies the software used.

Choosing future-proof data formats

Not all file formats are equally well prepared for the future. The following considerations are important when choosing a particular type of software and file format:

Is an open, standardised data format available?

Can open source software be used?
Can the data be stored in an open, standardised data format?

The documentation of open, standardised data formats is freely available so that anyone who wants to can (re)write the software needed to read the data themselves. It can also be used free of license fees and is supported by several software vendors and/or open source initiatives. An open file format can also originate from a commercial company. An example is PDF/A. Although this is a product of the commercial company Adobe, it does have open specifications. That is why it is an open file format.

Please note that open file formats may lack a number of specific functions that are specific to the commercial software product whose specifications have not been published. In that case, it makes sense to store the data in both data formats: the open data format and the closed commercial file format.

Examples of open data formats are PDF/A, .csv, .odf, .xml, .rstat, .mxf, etc. PDF/A Has been specially developed for long-term accessibility (British Library, 2019).

Can the data format be opened by other software?

New operating systems, software and hardware appear in rapid succession. It is not self-evident that the new versions support the use of files created with earlier versions. At the same time, a lot of software is able to open file formats of other software. OpenOffice opens Microsoft formats; QGIS opens GIS data from ArcGIS. The fact that not every format is stable within its own software environment is a major concern. But if a data format receives good support outside its own software environment, this will obviously make a huge difference to its future-proof potential.

The interoperability of data is greater if researchers use the same format, or if one person's software can also read and edit the other person's format. This applies both during research and for a long time afterwards.

When is it legitimate to choose a closed data format?

For many areas and applications, such as Computer Aided Design (CAD) or Virtual Reality, no open data formats are available at all. In this case, try to choose a commonly used data format.

It is also possible that open file formats lack a number of specific functions that are specific to their closed variant (the commercial software product whose specifications have not been published). In that case, it may be useful to store the data in both data formats: the open data format and the closed commercial file format.

Examples of closed data formats are .xlsx, .pptx, .stata, .mov and .dxf.

Unsure about a future-proof data format? Take a look at the preferred formats of data archives

Data archives are committed to long-term accessibility of research data. The file format in which data is stored is of great importance in ensuring future proof potential. This is why data archives often publish lists of preferred formats. These are data formats that are preferably independent of specific software and of which the specifications are well documented and openly available. Visit the page 'Preferred formats' for examples.

In the spotlight

Research Software Directory with Open Source software/code

The Research Software Directory (Netherlands eScience center (n.d.)) was developed as a tool to find existing open source software/code that can be used during research.

For data format geeks: a dive into MIME types

Data formats are often indicated by their MIME type. MIME stands for Multipart (Multipurpose) Internet Mail Extension. A more modern name for this is media type, but MIME type is still more established. MIME provides the operating system with information on how to handle a file.

A MIME type is written down as two indications separated by a slash (MIME type/subtype). Example: text/plain is the MIME type for plain text.

Many people recognise data formats by their extension. These are usually the three or four letters after the last point in the file name. For example, a movie on your computer has the extension .avi. The corresponding MIME type is video/msvideo. If the .avi movie is on a website, the URL doesn't have to end in .avi even when it is in fact an .avi. And an extension doesn't always have to be correct, for example when the file has been renamed. Someone might alter a file extension to .CH1 for 'chapter 1'. Also, several types of formats can use the same extension, for example .mid is used for a MIDI sound file as well as for the geographical map file Mapinfo Interchange Drawing.

The advantage of using MIME types is that they are located in the 'http header' of a web page (which is invisible to the user), so the operating system knows with which program to open a clicked file.

MIME types in 4TU.Centre for Research Data

If you go to 4TU.ResearchData (n.d.), you will see MIME types in the left column (look in the left column under the heading Data format and click on 'more'). Application means that the file is related to a certain type of application or program. Strictly speaking, applications are data formats that are read by a particular application.

application/pdf
When exchanging programs, the layout of documents can sometimes be lost or shifted. To prevent this, applications exist that provide a universal view of the document. An example is a PDF document (Portable Document Format). This is an open and universal file format for the electronic exchange of documents in which the layout is retained.
application/vnd.google-earth.kml +xml
The geographical data for the above MIME types are encoded in such a way that they are readable in a so-called earth browser such as Google Earth, Google Maps, and Google Maps on your mobile. The indication '+xml' indicates that this is an xml file. You can view the content of these files with a regular text editor.
application/gml+xml
GML stands for Geographic Markup Language: a standard way of describing geographical information. Geographical data describe the world in spatial terms, simply in plain text. It is a language that is independent of any form of visualisation of that data. In an earth browser the data is visualised. The indication '+xml' indicates an xml file. You can view the content of these files with a regular text editor.
application/x-java-archive
A dataset associated with this MIME type is related to the Java programming language. "Archive" means that it is a container format to pack files and directories, such as zip.
application/octet-stream
In this case, there is a general type of binary data that is not further defined. It is a residual category for all datasets for which it is not clear what it is or for which there is simply no separate MIME type.

The text formats plain, html and xml:

text/plain means that the file consists of text. Often this is simply text without formatting, but any file that can be read with a text editor (such as notepad) is basically text/plain. However, you can sometimes give a more specific MIME type, such as text/xml or text/csv. In such a case, more specific is always better, because it gives more information.
text/html (HyperText Markup Language) is a format that indicates what the information will look like on a website. You can use code to indicate what the text should look like: for example, printed or italic. This formatting is not present in plain text.
text/xml (or application/xml). In XML (eXtensible Markup Language) you do not specify the format, but you can provide information about the content of the file, for example by adding metadata such as <title> for a title and <creator> for the person who created the document.
Note that xml uses two different MIME types. This is not uncommon. Sometimes a text-based datatype uses application/... and not text/... because it is not meant to be viewed in a text editor, although it is possible.

Numerical data:

application/x-matlab data is an example of numerical data: Matlab is an advanced scientific calculator.
HDF5 (application/x-hdf5) and NetCDF (application/x-netcdf) are both data formats that are frequently used to store large amounts of numerical data.

Sources

Click to open/close

4TU.ResearchData (n.d.). 4TU SEARCH. https://data.4tu.nl/portal

British Library (2019). PDF Format Preservation Assessment Part 2: PDF/A Profile. http://wiki.dpconline.org/images/2/22/PDFA_Assessment_v1.0.pdf

Netwerk Digitaal Erfgoed (n.d.). Leren Preserveren. Beperkt Houdbaar [Cursus]. https://lerenpreserveren.nl/topic/beperkthoudbaar/

netherlands eScience Center (n.d.). Research Software Directory. https://www.research-software.nl/

UK Data Archive. (2011). Managing and sharing data. Retrieved from http://www.data-archive.ac.uk/media/2894/managingsharing.pdf

Collecting data

Omhoog

Storing data