An in depth look - data processing at DANS 

The processing of datasets by DANS occurs in three distinct phases:

  • Ingest (deposition)
  • Processing (archiving)
  • Access (dissemination, presentation, accessibility)

The data managers at DANS make use of a data processing workflow. This workflow prescribes the various steps to be taken in each of the three phases listed above, and is stored in EASY as a document attached to the dataset. In this document, the archivist records his actions, flagging each step with his name and date. The dataset can only be published once every step has been flagged.

The data managers act according to a protocol, which is included in  the Provenance document(1). In addition, an internal document describes these actions in more extensive detail. Information on the exact procedures may freely be requested at info@dans.knaw.nl.

The accordion below reproduces every step of the data processing trajectory in detail.

Ingest (opname)

Ingest

Datasets are deposited into EASY by the researchers or organizations themselves. Research Data Netherlands has made a video of this process. The depositor is also responsible for supplying the descriptive information accompanying a dataset and entering it into metadata fields, according to the (Qualified) Dublin Core international standard. The data files forming the set can be uploaded during deposition. The depositor may compress multiple files into a ZIP package, and the system will automatically unpack it. Larger datasets may also be delivered through a file transfer program; please contact DANS.

Once the dataset has been received, the data manager starts his work by performing a series of checks. He verifies if the dataset is complete or whether any files are missing, and if the set is fully intelligible for other researchers. For example, tables within the dataset may use variable codes or abbreviations. In this case, a code book must be present explaining the use of these variables to other researchers. Thirdly, the data manager checks the set for privacy-sensitive information.

In some cases, the data manager may ask the depositor to include a file list in a spreadsheet containing an explanation of the contents of the deposited datasets. At DANS, such a list is mandatory for archaeological datasets. The file list is also reviewed by the data manager.

Finally, the data manager checks the Dublin Core project description for omissions, ambiguities, and (typing) errors.

The data content is not assessed or modified. If any information is missing, the data manager will contact the depositor.

Processing

Processing

DANS has prepared a list of preferred file formats(2). These are file formats, sorted by file type, of which DANS is confident that they will offer the best long-term guarantees in terms of usability, accessibility and sustainability. The list also includes accepted formats, which are formats that are widely used in addition to the preferred formats, are moderately to reasonably usable, accessible and robust in the long term and can often easily be converted into a preferred format.

Depositors are requested to deliver their files in a preferred format whenever possible.

For file formats not included in the preferred format list, DANS individually assesses the possibilities for each dataset: can the data be delivered in a different format? Can the files be converted?

For some exceptional formats, conversion may not be possible. As there is no other way to archive these data, DANS may still accept these files. In this case, however, DANS cannot guarantee the dataset's long term sustainability and accessibility.

The original files are always saved with the dataset in their unaltered form, in a separate subdirectory labelled “original”. In the case of format conversion, the converted files are attached outside this subdirectory.

DANS sees to the logical unity and structure of the dataset. For this reason, the data manager may deem it necessary to restructure the files within a dataset into directories and subdirectories.

Finally, the data manager converts the depositor's file list (see the “Ingest” tab) into an XML file, and the system automatically scans and processes the information from this list in order to add the information to the file details in EASY. 

Access

Access

The depositor personally selects the access rights settings for his dataset. For example, it may be set to open access - unrestricted (CC0 Waiver - No rights reserved). The data manager then ensures that the right files are made available according to these settings. In the case of format conversion, only the converted files are made public; the original files are archived with the dataset, but remain invisible to users.

On request by the depositor, the data manager can attach different access rights settings to different files.

Datasets in EASY consist of three tabs: the overview (front page and summary), the description (Dublin Core project description), and the actual data files. The data manager may choose to add a more sophisticated layout or pictures to the overview in order to improve the dataset's presentation. The Dublin Core project description and any added metadata are always visible to users, regardless of access rights.

Every dataset in EASY is given a persistent identifier, a sustainable hyperlink which can be used as a source reference. The persistent identifier is automatically generated when a dataset is deposited. The persistent identifier is activated when the data manager publishes the dataset.

Users with access rights to the files may make their own selection of files they wish to download. A ZIP package is then created, containing the selected files, a PDF file listing DANS' general provisions, and another PDF containing the metadata attached to the selected files.

EASY has a limit of 400 files and/or 800MB per download. For larger datasets, an alternative arrangement can be made with DANS. 

  Sources

Click to open/close

Sources


botMessage_toctoc_comments_9210