In the spotlight
Processing of datasets at DANS
The data managers of DANS act in accordance with a protocol described in the Provenance document DANS, n.d.a.). A dataset in EASY can only be published when a datamanager has ticked all required actions from the workflow. An extensive internal document exists in which all these actions are described in more detail. Information about the exact working method can always be requested via email@example.com. The accordion contains a number of examples of the actions that are taken.
Checks on data ingest
Datasets are deposited by researchers or organisations themselves in DANS EASY (DANS. n.d.b.). RDNL made a movie about this (RDNL, 2016). The dataset used as an example in the movie can be found in DANS EASY (Gemeente Dordrecht, 2011).
The depositor takes care of the description of the datasets in metadata fields according to the international standard Dublin Core (DCMI, n.d.). Files belonging to the dataset can be uploaded during ingest. The depositor can compress several files into one ZIP file. When this ZIP file is uploaded, the system unpacks it automatically. In consultation with DANS, large datasets can also be delivered using a file transfer service.
After receiving the dataset, the data manager's procedure starts with checks. The dataset is checked for the presence of privacy-sensitive information. The data manager further checks whether the dataset is complete: whether any files are missing and whether the dataset can be fully understood by other researchers. For example, there may be tables in the dataset that use variables, codes and/or abbreviations. In this case, a codebook must be present in which these variables are explained.
If the dataset contains files whose contents cannot be deduced from the file name or folder structure, the data manager may ask the depositor to send a summary document or a file list in a spreadsheet. A file list contains an explanation of what the deposited files contain. At DANS, such a file list is mandatory for archaeological datasets.
The data manager also checks the Dublin Core project description for incompleteness, ambiguities and (typographical) errors. An archivist will make minor adjustments to the metadata if this improves the clarity of the dataset.
The data will not be assessed or adjusted in terms of content. If information is missing, the data manager contacts the depositor.
Converting and restructuring files
DANS has drawn up a list of preferred formats (DANS, n.d.c.). These are file formats, sorted by file type, which DANS believes offer the best long-term guarantees in terms of usability, accessibility and sustainability. The list also contains an overview of 'non-preferred' (not preferred) formats: frequently used file formats that can often easily be converted into a preferred format.
Depositors are requested to deliver their files in a preferred format where possible.
For file formats that are not included in the list of preferred formats, DANS assesses the possibilities for each dataset separately: can the data be delivered in a different format? Can the files be converted?
For some exceptional formats, conversion may not be possible. Since there is no other way to archive this data, DANS may still accept these files. In this case, however, DANS cannot guarantee the durability and accessibility of the dataset.
The original files are always stored with the dataset in an unaltered form, in a separate subfolder labeled "original". In the case of format conversion, the converted files are added outside this subfolder.
DANS takes care of the logical arrangement and structure of the dataset. For this reason, the data manager may consider it necessary to restructure the files within a dataset into folders and subfolders. In this picture you see an example of data processing after the dataset has been deposited and before it is offered to users. On the left side you can see the files that a data depositor submitted. On the right side you can see how a data manager of DANS reordered the files before making them available to EASY users:
- The photos are no longer separate but in a folder called 'Foto's';
- The Excel file is converted to .csv. This preferred format can easily be opened as text and as a table.
Finally, the data manager converts the depositor's file list into an XML file that allows the system to automatically add the information from this list to the file details in EASY.
Ways to download data
The depositor him- or herself selects the access rights under which he or she makes the files of the dataset available, for example open access - unlimited (CC0 Waiver - No rights reserved). The data manager ensures that the correct files are made available under these rights. With file conversions, only the converted files are published; the original files are archived with the dataset, but they are invisible to users.
By agreement with the depositor, the data manager can ensure that different files are granted different access rights.
Datasets in EASY consist of three tabs: the overview (front page with summary), the description (project description in Dublin Core) and the data files (files). The data manager can choose to make the overview page as an HTML page with images, for example, in order to improve the presentation of the dataset. The Dublin Core project descriptions of datasets and any metadata that have been added are always visible to users, regardless of the access rights.
Each dataset in EASY is provided with a persistent identifier, a durable shortcut that can be used as a reference to the dataset. The persistent identifier is automatically assigned to the dataset when the depositor submits the dataset. The persistent identifier becomes active as soon as the data manager completes the ingest process and publishes the dataset.
Users who meet the conditions for accessing the files can make their own selection of the files they want to download. The selection is made into a zip file. The download package also includes a PDF with the general provisions of DANS. In addition, the zip file contains an XML of the metadata that are linked to the files.
EASY has a download limit of 400 files and/or 1000MB (1GB) at the same time. For larger datasets, an alternative way of sending the data can be agreed with DANS.
Conversion of data formats
At 4TU.Centre for Research Data there are three moments when data can be converted:
- Before upload
In principle, it is the responsibility of the supplier of the dataset to deliver the data in a sustainable format. If data are supplied in a non-sustainable format, it is necessary that they are first converted (converted) into a sustainable format. For example, the IDRA weather radar data collection consists of a large number of numerical files stored as NetCDF (Otto, 2010). The party that supplied the data did not have the data in NetCDF, but did so on the instructions of 4TU.Centre for Research Data.
Many of researchers' datasets look like numbers with commas in between. These are self-conceived formats that are converted into NetCDF. Although a simple table could still be converted into .csv (comma separated value), 4TU.Centre for Research Data also prefers NetCDF because it adds standard internal metadata and the possibilities for use are greater.
- After upload
Over time, certain data formats will become unusable. Once managed by 4TU.Centre for Research Data, it is the responsibility of 4TU.Centre for Research Data to carry out the necessary conversions that guarantee a long lifetime of research data. At Darelux (4TU.Center for Research Data, n.d.a.) for example, an old set, a lot of conversion has already been done. The dataset was first converted into its own XML format. Next, the NcML (Unidata, n.d.), the XML version of NetCDF, was made and then the dataset was converted to NetCDF. Then the whole dataset was moved from the server of 4TU.Centre for Research Data itself (Fedora) to OPeNDAP. The choice to convert the dataset to XML was made because standard metadata are added. You can provide information about the content and that keeps the dataset readable and understandable. The choice for converting into NetCDF was made because of the possibilities for use (see the tab 'data interaction').
In addition, 4TU.Centre for Research Data may decide to convert the format due to storage capacity. The helicopter data set with aerial photographs of traffic routes (Hoogendoorn, 2010) was delivered in .tiff, a format that takes up a lot of storage space. This set was converted to .png. Whether such a conversion is possible depends on the application of the research in question.
- At download
If a user wants to download a dataset (DIP), he or she may prefer a certain data format. Various formats are available for a number of datasets, such as NcML next to netCDF. This can be used to generate additional formats such as CDL and .csv. Moreover, data on the OPeNDAP server (almost all netCDF and hdf-5 data) can be accessed in different ways, i.e. not just as a download. This is further explained in the tab 'data interaction'.
Data compression means that you reduce the amount of space that research data takes up. You represent the digital information with less bits than the original data. This is useful if you want to store or transport large amounts of data.
After uploading a dataset (SIP), a bagit is first made of it. This is a kind of an inventory: what is in this dataset? Bagit is a format in which basic metadata and a so-called checksum are added to every file in a dataset. The checksum is the fingerprint of a file in the dataset. What happens when making a checksum is that all the bits are added together in a certain way. When you calculate the checksum of a downloaded file, it must correspond to the specified value that was calculated on the server. If not, something must have gone wrong. After bagging the dataset, the whole thing is stored (zipped) in a compressed package.
A .zip file format is probably the best known (application/zip). application/x-gzip is the abbreviation for GNU zip: This is a data compression program for Unix and Linux.
Unix and Linux are so-called Operating Systems (OS). You are free to use, study and modify Unix/Linux. Maybe you know the names of other Operating Systems better. Windows is the OS for Microsoft and Mac OS X is the OS for Apple. These are commercial operating systems. An operating system ensures that all applications on your PC can be run properly.
Datasets with the formats NetCDF and HDF5 are not on the server of 4TU.Centre for Research Data itself, but they are on another server called 'OPeNDAP'. Datasets on the OPeNDAP server are directly accessible from programming languages. OPeNDAP communicates with the data in a certain way, making local data available for remote locations.
If you connect NetCDF or HDF5 data to each other with OPeNDAP, it is easier to let ask a question to the dataset with a so-called query that returns a precise selection from the data.
An example: the Heavy particles in turbulent flows (Lanotte, 2011a) dataset is stored in HDF5. The dataset contains about 30 billion numbers in five dimensions. The format in which it is stored makes it possible to see a selected part of it. As you can see, a data file (Lanotte, 2011b) consists of 103.2 GB. You can see a cut-out of the dataset and that saves a lot of download time.
More about OPeNDAP and NetCDF at 4TU.Centre for Research Data can be found on the website of 4TU.Centre for Research Data (n.d.b.)
Een voorbeeld: de Heavy particles in turbulent flows (Lanotte, 2011a) dataset is opgeslagen in HDF5. De dataset bevat zo'n 30 miljard getallen in vijf dimensies. Het formaat waarin het is opgeslagen, maakt het mogelijk daar een deel van in te zien. Zoals je ziet, bestaat een data file (Lanotte, 2011b) uit 103,2 GB. Je kunt een uitsnede van de dataset inzien en dat scheelt een hoop downloadtijd.