Data formats

  Main points

Research data appears in many shapes and sizes(1): text, numerical data, models, software, multimedia. In addition, there is also discipline-specific research data, or data characteristic for the tool used to measure it.

A data format or file format is the format in which the data is coded. The information is coded in such a way that a program or application can recognize, read and use the data.

The history of digital storage(2) provides a wonderful insight into the limitations of information carriers. If software/hardware is no longer used, data can become unreadable. In order to prevent this, it is vital to choose an open format: that is a software format that is not attached to a certain software supplier (proprietary software). For open formats all format details are public, everyone can (re)write the software used to read the data if they want to. If it is a standard open format, someone else has probably written the software for you.

Data archives often use a list with preferred formats for researchers to supply their data in. Data archives prefer open formats because it enables them to guarantee a longer research data life.

Loss of data

Research data can become unusable in three different ways:   

  • Loss of bits.
    The carrier is damaged, is lost of its quality deteriorates in such a way that it effect the bits, popularly called 'bit rot'. 

  • Loss of documentation.
    It is unclear how one file is tied to another, for instance when different versions of a file or the metadata are no longer available, leaving the meaning of the data unclear. 

  • Loss of display possibilities.
    The operating system, the hardware or the application are no longer present or cannot be used anymore. That can also be caused by external factors such as computer virus, fire or accidental deleting of files.  

To prevent research data from becoming unusable, having a data storage strategy is essential. Read more about it in the next section

   An in-depth look: MIME types

Data formats are often indicated by their MIME type. MIME stands for Multipart (Multipurpose) Internet Mail Extension. MIME provides web browser information on how to deal with a file.

A MIME type is noted as two indications separated by a slash (MIME type/subtype). Example: text/plain is the MIME type for plain text.

Many people recognize data formats by their extension – the three of four letters following the file name. A video on your computer, for instance, has the extension .avi. The corresponding MIME type is video/msvideo. If the .avi video is on a website, the URL does not have to end in .avi for an .avi file. An extension does not always have to be the correct one, because it can be renamed, for instance, and does not refer to a data format anymore. Someone can decide to use extension .CH1 for ‘chapter 1’. It is also possible for several types of formats to use the same extension, for instance .mid for MIDI audio files and the geographic map file Mapinfo Interchange Drawing.

The advantage of the use of MIME types is that the website page source can always be used to trace it. It is a file format that is transferred 'under water' and can also be read by computers. 

MIME types with more information and some examples.

What do I have to know about data formats?  

It is not necessary to know all the technical ins and outs when you want to acquire knowledge about data formats. It is important, however, to have an idea of all factors involved to be able to give a researcher general advice about the best data format to store datasets in. You will be able to explain that long-term, sustainable storage requires a certain data format. If researchers still want to submit their data in a different way, they will not be guaranteed that the research data will be usable for a long time. When in doubt, refer to an expert in this field


Click to open/close
  1. UK Data Archive. (2011). Managing and sharing data. Retrieved from
  2. Mashable. (2011). The history of digital storage. Mashable Infographics. Retrieved from