An in depth look - data processing at 4TU.Centre for Research Data

Conversion

There are three moments at which data may be converted:

Before uploading. In principle, it is the responsibility of the dataset provider to deliver the data in a sustainable format. If data are delivered in a non-sustainable format, they must first be converted.
As an example, the IDRA data collection(1) of weather radar measurements consists of a large set of numerical files stored in NetCDF format. The party supplying this dataset did not, in the first instance, deliver the data in NetCDF, but converted the files at the direction of 4TU.Centre for Research Data.
Researchers' datasets often look like a list of numbers separated by commas. These are invented formats, which are usually converted into NetCDF. Although a simple table might be rendered as a CSV (Comma Separated Value) file, 4TU.Centre for Research Data prefers the NetCDF format even in these cases, because it adds internal metadata by default and has more functionality.

After uploading. As time passes, certain data formats will become unusable. Once a dataset has been deposited, it is the responsibility of 4TU.Centre for Research Data to make the necessary conversions to guarantee its sustainability.
On the old Darelux(2) dataset, for example, several conversions have been performed since it was uploaded. First, the dataset was converted into a particular XML format. It was subsequently turned into NcML(3) (NetCDF's version of XML), and then converted into NetCDF itself. Finally, the set was moved from 4TU.Centre for Research Data's own server (Fedora) to OPeNDAP. The conversion into XML was performed primarily to facilitate the inclusion of metadata, allowing for the addition of information keeping the set readable and intelligible for future use. The choice to convert the set once more into NetCDF was made because of that format's additional functionality (see the “Data interaction” tab).
In addition, 4TU.Centre for Reserach Data may decide to convert a dataset into a different format on account of storage capacity. The set of helicopter data (4) (aerial images of highways) was delivered in TIFF, a format with a very large file size, and later converted into PNG. Whether a conversion like this is possible or advisable depends on the specific aims of the research in question.

When downloading. When a user wants to download a dataset (DIP), he may prefer a specific data format. Some datasets are available in several formats: NcML as well as NetCDF, for example. From these, additional formats, such as CDL or CSV, may then potentially be generated. Moreover, data on the OPeNDAP server (virtually all of the NetCDF and HDF5 data) can be approached in several ways, not just as a download. For more information, see the “Data interaction” tab.

Compression

Data compression means reducing the storage space the research data is taking up, or, in other words, representing the same digital information using fewer bits than the original data did. This is useful when storing or transferring large amounts of data.

After you upload a dataset (SIP), a BagIt is created. This is a kind of inventory, recording the contents of the dataset. In the BagIt format, basic metadata and a so-called checksum are added to every file in the set. This checksum is a number created by adding up every bit in the file in a certain way, and can be seen as the file's “fingerprint”. If you calculate the checksum of a downloaded file, its value should be identical to the value calculated by the server. If not, something has gone wrong with the download process. After the dataset is “bagged”, the whole thing is stored as a single compressed (zipped) package.

ZIP (application/zip) is probably the best-known compression format. The MIME type application/x-gzip refers to GNU Zip, a data compression program for Unix and Linux. Unix and Linux are so-called open operating systems (OS). They are free to use, study, and edit. The names of other operating systems, like Microsoft's Windows and Apple's Mac OS X, are perhaps more familiar. An operating system ensures that your PC properly runs your applications.

Data interaction

Datasets in the NetCDF and HDF5 formats are not stored on 4TU.Centre for Research Data's own server (Fedora), but on a separate server called OPeNDAP. Datasets on this server can be directly accessed through a programming language. The server communicates with the data in such a way as to make locally stored data available remotely.

If you integrate your NetCDF or HDF5 data in OPeNDAP, it will be easier to input a query for a precisely defined selection of data. In this case, the DIP returned is a part of the AIP.

As an example,  the Heavy particles in turbulent flows(5) dataset is stored in HDF5 and contains approximately 30 billion numbers plotted in five dimensions. As you can see, a single data file(6) is as large as 103.2GB. The format in which the set has been stored, however, allows you to view a smaller selection of the data, which will save a lot of downloading time.

A large part of the OPeNDAP server is currently accessible through the 4TU.Databrowser(7), meaning you can view the contents of the OPeNDAP server through 4TU.Centre for Research Data's interface.

If you're particularly interested, the Deltares Public Wiki has notes(8) on working with OPeNDAP data.

 

 

  Sources

Click to open/close

Sources

  1. Otto, T.; Russchenberg, H.W.J.; Reinoso Rondinel, R.R.; Unal, C.M.H.. (2010). IDRA weather radar measurements - all data. TU Delft. [dataset]. https://data.4tu.nl/repository/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d
  2. 4TU.Centre for Research Data. Collection: Darelux - River Environment Luxemburg. Retrieved from https://data.4tu.nl/repository/collection:darelux
  3. Unidata. The NetCDF Markup Language (NcML). Retrieved from http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/
  4. 4TU.Centre for Research Data. Collection: Traffic flow observations. Retrieved from https://data.4tu.nl/repository/collection:traffic_flow_obs
  5. Lanotte, A.; Calzavarini, E.; Toschi, F.; Bec, J.; Biferale, L.; Cencini, M. (2011). Heavy particles in turbulent flows RM-2007-GRAD-2048. iCFDdatabase. [dataset]. https://data.4tu.nl/repository/uuid:f7cd7b9d-ae4e-498e-92b4-7efe2d350d86
  6. Datafile. Retrieved from https://data.4tu.nl/repository/uuid:607a19d6-32c0-4b33-a8c1-95293637c2ac
  7. 4TU.Databrowser. Retrieved from http://data.4tu.nl/repository/http://data.3tu.nl/repository/resource:repository/object/search?q=
  8. Deltares. Tech Notes. Retrieved from http://publicwiki.deltares.nl/display/OET/Tech+Notes

botMessage_toctoc_comments_9210