PIDs and data citation

Data citation is the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to other scholarly resources. | ANDS, 2017

The correct citation of research data - data citation - is seen as one of the most important ways in which research data can be counted as 'first-class research output'. In this section, we will show what other advantages data citation offers, what role persistent identifiers (PIDs) play and what a data citation looks like.

Working on a culture of data citation

The publication of datasets increasingly counts as a citable contribution to the research curriculum. DataCite (n.d.a.) is an important player in building the technical infrastructure to enable data citation. In addition, it is the research community itself that has published two manifestos to point the way: one with a number of data citation principles (FORCE 11, 2014) and one with software citation principles (Smith, 2016). These initiatives form the basis for building a culture of data citation (ANDS, n.d.). 

Citing research data is part of the Altmetrics (2010) - alternative metrics - movement that states that the impact of your research is determined by (the references to) a wide range of research output such as datasets, software, blog posts, presentations, etc. 

Data citation: 

  • Makes data easier to find;
  • Promotes reproducibility;
  • Promotes reuse of data; 
  • Makes it possible to track the impact of the research data;
  • Creates a publication structure that enables long-term availability of data;
  • Provides a structure in which the impact of the data can be traced back to the researchers who created the data.

Persistent identifiers and data citation

To be citable, a dataset needs a persistent identifier (PID), a unique label that is linked to a digital object. This means that the object can always be found, even in the event of changes of name and location. With a PID you can prevent the creation of broken links or a page not found.

When publishing data in a data archive, a PID is automatically assigned to the data. A PID is conditional for the F of FAIR data. Without a PID, a dataset cannot be found in a sustainable way. A PID is therefore necessary, but not sufficient for FAIRness. If the dataset is only assigned a PID and no machine-readable metadata, it will still be difficult to find a dataset, unless the PID is known. It is via the metadata that a dataset is found and via the PID that the dataset is then located. 

In the video below we explain the role of a PID - in this case the DOI (n.d.) - in data citation.  

RDNL video concerning data citation; select HD-quality for the best viewing experience.

 

Connecting PIDs

Persistent identifiers describe a kind of endpoint. To be really useful, these endpoints must be connected to each other (Haak et. al., 2018). To be able to create a so-called 'research graph' in which the relationships between data, researchers, publications, research funders, organisational resources, etc. can be seen at a glance, more PIDs are needed than those for the research data alone. A well-known PID for a unique researcher is ORCID iD (n.d.). 

PIDs act as both unique identifiers and, critically, as connectors. By unambiguously identifying and connecting an individual researcher with their research organisations, professional activities and other contributions, we can be confident that we understand – and can assert – the relationships between each of them. And, by doing so using resolvable PIDs that incorporate FAIR metadata, we also make researchers, their affiliations and their contributions more easily discoverable. | Meadows, 2019

In the spotlight


Different PID systems and the PID guide

Different persistent identifier systems eixt (DPC, 2017), for example the URN, Handle, PURL, ARK and DOI. Depending on the purpose, an object can be assigned one of these persistent identifiers. With the PID guide (Netwerk Digitaal Erfgoed, n.d.), you go through about 25 questions, after which a PID that seems best for your organisation and goals will appear.

DOIs are increasingly accepted as the persistent identifier of your choice when it comes to data citation. This is noticeable, among other things, because systems that offer other persistent identifiers are also going to offer DOIs. Dataverse Network first only offered Handle and then switched to DOIs. In addition to URNs, DANS now also offers DOIs. We will therefore zoom in on the DOI below. 

Zooming in on the DOI (Digital Object Identifier) as a PID for data citation

Here, we repeat what was said in the video about data citation.

A DOI (Digital Object Identifier) is ideally suited to make a digital object citable and is only assigned to objects that should remain managed and accessible for the long term. DOIs are already widely used in scientific literature to link to journal articles. By assigning a DOI to a dataset, you make the origin traceable and citable.

Structure of a DOI 

A DOI consists of two parts:

  • prefix consisting of the number '10' followed by 4 or more digits;
  • and a suffix;
  • seperated by a slash.

The identification code in the prefix stands for the person who registered the dataset. The slash is followed by the identifier for the dataset.

Example of a DOI: 10.4121/uuid:c1ac7344-1419-4398-ba13-c757551c303f.

Registration

DOI’s are registered via DataCite (n.d.a.) and in the Netherlands via DataCite Netherlands (Delft University of Technology, n.d.). A researcher receives a DOI for his or her dataset as soon as he or she deposits data in a data archive that is a customer of DataCite. The institution then registers the DOI for the dataset that is archived by the institution itself. As an individual researcher, you cannot register a DOI. This is the general policy of DataCite.

When a DOI is registered, it is mandatory to provide a minimum set of metadata. All mandatory, optional and recommended metadata are described in the DataCite Metadata Scheme (DataCite, 2019). All assigned metadata are stored in the so-called DataCite Metadata Store (DataCite, n.d.b.) and are therefore searchable.

Citation

DataCite advises how to cite a dataset if you mention it in a publication (DataCite, n.d.c.). The recommended citation style is:

Creator (PublicationYear): Title. Publisher. Identifier

For this dataset (Keen, 2011) in 4TU.Centre for Research Data, for example, this looks as follows:  

Persistent identifiers at DANS, 4TU.Centre for Research Data and SURF

DANS, 4TU Centre for Research Data and SURF handle persistent identifiers differently. 4TU.Centre for Research Data uses DOIs, DANS uses DOIs and URN:NBNs and SURF uses the Handle system. Datacite DOIs are suitable and intended for citing data, the URN:NBN is primarily aimed at identifying and is less used as a citation tool. Handle is an 'all purpose' PID system and is especially useful for assigning PIDs to large quantities of objects (Netwerk Digitaal Erfgoed, n.d.).    

PIDs at 4TU.Centre for Research Data

4TU.Centre for Research Data registers DOIs via DataCite Netherlands (Delft University of Technology, n.d.). Within 4TU.Centre for Research Data, all datasets that are provided with the required metadata have a DOI. They all have an UUID (Universally Unique IDentifier). A UUID consists of 36 characters (32 letters/numbers and 4 dashes) in the form of 8-4-4-4-12 characters. For example: uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f. The total number of possible unique UUIDs is so large that it is unlikely that two identical UUIDs will be created.

The DOIs of 4TU.Centre for Research Data are prefixed with the URL of the data centre and have the UUID as a suffix, for example https://data.4tu.nl/repository/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f. On the landing page of the dataset it says: 'please cite/link this dataset as doi:10.4121/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f'. The code 4121 stands for 4TU.Centre for Research Data.

If you want to find a DOI, put dx.doi.org or doi.org in front of it. Then you will always come to the right place. You can also use  resolve a DOI. The resolver must also be kept for the long term, of course. This is done by the international DOI foundation. There are no concerns about maintaining the resolver: "It's too big to fail".

PIDs at DANS

At DANS, all datasets have two persistent identifiers: a DOI and a URN:NBN. Both are automatically assigned when the data manager approves and publishes a deposited dataset. The DOI can be used by researchers to refer permanently to the dataset. DANS has been unsing the URN:NBN persistent identifier for sustainable access to all the material in the archive for a long time. DANS manages the Dutch resolver for the URN:NBN (DANS, n.d.).

A DANS a dataset gets two PIDs. This looks as follows:

Schöpfel, Dr. J. (University of Lille, GERiiCO laboratory) (2019): Data Papers as a New Form of Knowledge Organization in the Field of Research Data. DANS. https://doi.org/10.17026/dans-zk3-jkyb 
DOI: 10.17026/dans-zk3-jkyb
URN: urn:nbn:nl:ui:13-iy-02u8

A URN:NBN is structured as follows:

  • URN as the identifier scheme;
  • NBN as namespace for so-called National Bibliographic Numbers;
  • NL:UI to indicate that these are identifiers that have been assigned within the Netherlands;
  • A unique code for the dataset within DANS.

PIDs at SURF

SURF has two flavours of PIDs:

  • For the SURF Data Archive (SURF, n.d.a.) SURF uses the Handle system (SURF, n.d.b.). The SURF Data Archive is suitable for storing larger quantities of data for a longer period of time. 
  • For the SURF Data Repository (SURF. n.d.c.), SURF offers DOIs in addition to Handle. To see what this looks like, take a look at the metadata of this dataset by Ishiyama (2011).  

Researchers/research institutions can also register their data collections via SURF and make them accessible with the aid of PIDs (SURF. n.d.b.). 

The PID forum

The PID-forum (n.d.) is a stand-alone discussion forum on 'all things PID' that originated from the FREYA project (n.d.).  

PIDs for software, posters, presentations and other research output

It makes sense to not only make research data available but also software code, posters and other research output. Some examples: 

  • Software code
    You can make software code citeable by publishing the code from GitHub to Zenodo. GitHub has a DIY guide available  (GitHub, 2016). 
  • Posters and other research output
    Posters and presentations are often shared on Figshare (n.d.) or Zenodo (n.d.a). Within Zenodo you can also create a community (Zenodo, n.d.b.) where you curate the collection of output with a group of people.  

Each upload gets its own PID in this way (both Figshare and Zenodo have the DOI as PID). For research output that does not automatically get a PID, this is an easy way to make that output findable, citeable and more visible.  


Sources

Click to open/close

Altmetrics (2010). Altmetrics: a manifesto. http://altmetrics.org/manifesto/

ANDS (n.d.). Building a culture of data citation. https://www.ands.org.au/__data/assets/pdf_file/0003/383025/data_citation_poster.pdf

ANDS. (2017). Data citation. ANDS Guide.  awareness. https://www.ands.org.au/__data/assets/pdf_file/0005/724334/Data-citation.pdf

DANS (n.d.). Resolve identifier. http://www.persistent-identifier.nl/

DataCite (n.d.a.). https://datacite.org/

DataCite (n.d.b.). DataCite MDS API. https://mds.datacite.org/

DataCite (n.d.c.). DataCite - Cite Your Data. http://www.datacite.org.s3-website-eu-west-1.amazonaws.com/cite-your-data.html

DataCite (2019, Augustus 16th). Datacite Metadata Schema. Metadata Schema 4.4. https://schema.datacite.org/

DCP (n.d.) Persistent identifiers. https://dpconline.org/handbook/technical-solutions-and-tools/persistent-identifiers

Delft University of Technology (n.d.). DataCite Netherlands. https://www.tudelft.nl/en/library/support/datacite-netherlands/

DOI (n.d.) https://www.doi.org/

Figshare (n.d.). https://figshare.com/ 

FORCE 11 (2014). Joint Declaration of Data Citation Principle. - Final. https://www.force11.org/datacitationprinciples

FREYA (n.d.). The FREYA project. https://www.project-freya.eu/en/about/mission

GitHub (2016). Making your code citable. https://guides.github.com/activities/citable-code/

Haak, L., Meadows, A., Brown, J. (2018). Using ORCID, DOI, and Other Open Identifiers in Research Evaluation. Front. Res. Metr. Anal, vol 3, p28. https://doi.org/10.3389/frma.2018.00028

Ishiyama, T., Rieder, S., Makino, J., Zwart, S.P., Groen, D., Nitadori, K., Laat, C. de, McMillan, S., Hiraki, K., Harfst, S. (2011). The Cosmogrid Simulation: Statistical Properties of Small Dark Matter Halos (2048-103). Leiden University. 10.25606/SURF.578c6039-0bf84511

Keen, A.S (2011): Erosive Bar Migration Using Density and Diameter Scaled Sediment Erosive Profile Set-Prototype Scale (Actual Scal 1:10). TU Delft. doi:10.4121/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f

Meadows, Alice, Laurel L. Haak, and Josh Brown. 2019. “Persistent Identifiers: The Building Blocks of the Research Information Infrastructure”. Insights32 (1): 9. http://doi.org/10.1629/uksg.457

Netwerk Digitaal Erfgoed (n.d.).https://www.pidwijzer.nl/en/pid_results/new

ORCID. (n.d.). Register for an ORCID iD. Retrieved from https://orcid.org/register

PID Forum. (n.d.) https://www.pidforum.org/

Smith, A.M., Katz, D.S., Niemeyer, K.E., FORCE11 Software Citation Working Group. (2016) Software Citation Principles. PeerJ Computer Science 2:e86. https://doi.org/10.7717/peerj-cs.86

SURF (n.d.a.). SURF Data Archive. https://www.surf.nl/en/secure-long-term-storage-with-data-archive

SURF (n.d.b.). Data Persistent Identifier: data always findable by permanent references. https://www.surf.nl/en/data-persistent-identifier-data-always-findable-by-permanent-references

SURF (n.d.c.). SURF Data Repository. https://repository.surfsara.nl/

Zenodo (n.d.a.). https://zenodo.org/

Zenodo (n.d.b.). Zenodo Communities. https://zenodo.org/communities/