PIDs and data citation

Data citation is the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to other scholarly resources. | ANDS, 2017

The correct citation of research data - data citation - is seen as one of the most important ways in which research data can be counted as 'first-class research output'. In this section, we will show what other advantages data citation offers, what role persistent identifiers (PIDs) play and what a data citation looks like.

Working on a culture of data citation

The publication of datasets increasingly counts as a citable contribution to the research curriculum. DataCite (n.d.a.) is an important player in building the technical infrastructure to enable data citation. In addition, it is the research community itself that has published two manifestos to point the way: one with a number of data citation principles (FORCE 11, 2014) and one with software citation principles (Smith, 2016). These initiatives form the basis for building a culture of data citation (ANDS, n.d.).

Citing research data is part of the Altmetrics (2010) - alternative metrics - movement that states that the impact of your research is determined by (the references to) a wide range of research output such as datasets, software, blog posts, presentations, etc.

Data citation:

Makes data easier to find;
Promotes reproducibility;
Promotes reuse of data;
Makes it possible to track the impact of the research data;
Creates a publication structure that enables long-term availability of data;
Provides a structure in which the impact of the data can be traced back to the researchers who created the data.

Persistent identifiers and data citation

To be citable, a dataset needs a persistent identifier (PID), a unique label that is linked to a digital object. This means that the object can always be found, even in the event of changes of name and location. With a PID you can prevent the creation of broken links or a page not found.

When publishing data in a data archive, a PID is automatically assigned to the data. A PID is conditional for the F of FAIR data. Without a PID, a dataset cannot be found in a sustainable way. A PID is therefore necessary, but not sufficient for FAIRness. If the dataset is only assigned a PID and no machine-readable metadata, it will still be difficult to find a dataset, unless the PID is known. It is via the metadata that a dataset is found and via the PID that the dataset is then located.

In the video below we explain the role of a PID - in this case the DOI (n.d.) - in data citation.

RDNL video concerning data citation; select HD-quality for the best viewing experience.

Connecting PIDs

Persistent identifiers describe a kind of endpoint. To be really useful, these endpoints must be connected to each other (Haak et. al., 2018). To be able to create a so-called 'research graph' in which the relationships between data, researchers, publications, research funders, organisational resources, etc. can be seen at a glance, more PIDs are needed than those for the research data alone. A well-known PID for a unique researcher is ORCID iD (n.d.).

PIDs act as both unique identifiers and, critically, as connectors. By unambiguously identifying and connecting an individual researcher with their research organisations, professional activities and other contributions, we can be confident that we understand – and can assert – the relationships between each of them. And, by doing so using resolvable PIDs that incorporate FAIR metadata, we also make researchers, their affiliations and their contributions more easily discoverable. | Meadows, 2019

In the spotlight

Different PID systems and the PID guide

Different persistent identifier systems exist (DPC, 2017), for example the URN, Handle, PURL, ARK and DOI. Depending on the purpose, an object can be assigned one of these persistent identifiers. With the PID guide (Netwerk Digitaal Erfgoed, n.d.), you go through about 25 questions, after which a PID that seems best for your organisation and goals will appear.

DOIs are increasingly accepted as the persistent identifier of your choice when it comes to data citation. This is noticeable, among other things, because systems that offer other persistent identifiers are also going to offer DOIs. At first, Dataverse Network only offered Handle and then switched to DOIs. In addition to URNs, DANS now also offers DOIs. We will therefore zoom in on the DOI below.

Persistent identifiers at DANS, 4TU.ResearchData and SURF

DANS, 4TU.ResearchData and SURF handle persistent identifiers differently. 4TU.ResearchData uses DOIs, DANS uses DOIs and URN:NBNs and SURF uses the Handle system. Datacite DOIs are suitable and intended for citing data, the URN:NBN is primarily aimed at identifying and is less used as a citation tool. Handle is an 'all purpose' PID system and is especially useful for assigning PIDs to large quantities of objects (Netwerk Digitaal Erfgoed, n.d.).

PIDs at 4TU.Centre for Research Data

4TU.ResearchData registers DOIs via DataCite Netherlands. Within 4TU.ResearchData, all datasets that are provided with the required metadata have a DOI. They all have an UUID (Universally Unique IDentifier). A UUID consists of 36 characters (32 letters/numbers and 4 dashes) in the form of 8-4-4-4-12 characters. For example: uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f. The total number of possible unique UUIDs is so large that it is unlikely that two identical UUIDs will be created.

The DOIs of 44TU.ResearchData are prefixed with the URL of the data centre and have the UUID as a suffix, for example https://data.4tu.nl/repository/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f. On the landing page of the dataset it says: 'please cite/link this dataset as doi:10.4121/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f'. The code 4121 stands for 4TU.Centre for Research Data.

If you want to find a DOI, put dx.doi.org or doi.org in front of it. Then you will always come to the right place. You can also use resolve a DOI. The resolver must also be kept for the long term, of course. This is done by the international DOI foundation. There are no concerns about maintaining the resolver: "It's too big to fail".

PIDs at DANS

At DANS, all datasets have two persistent identifiers: a DOI and a URN:NBN. Both are automatically assigned when the data manager approves and publishes a deposited dataset. The DOI can be used by researchers to refer permanently to the dataset. DANS has been using the URN:NBN persistent identifier for sustainable access to all the material in the archive for a long time. DANS manages the Dutch resolver for the URN:NBN (DANS, n.d.).

A DANS dataset gets two PIDs. This looks as follows:

Schöpfel, Dr. J. (University of Lille, GERiiCO laboratory) (2019): Data Papers as a New Form of Knowledge Organization in the Field of Research Data. DANS. https://doi.org/10.17026/dans-zk3-jkyb
DOI: 10.17026/dans-zk3-jkyb
URN: urn:nbn:nl:ui:13-iy-02u8

A URN:NBN is structured as follows:

URN as the identifier scheme;
NBN as namespace for so-called National Bibliographic Numbers;
NL:UI to indicate that these are identifiers that have been assigned within the Netherlands;
A unique code for the dataset within DANS.

PIDs at SURF

SURF has two flavours of PIDs:

For the SURF Data Archive (SURF, n.d.a.) SURF uses the Handle system (SURF, n.d.b.). The SURF Data Archive is suitable for storing larger quantities of data for a longer period of time.
For the SURF Data Repository (SURF. n.d.c.), SURF offers DOIs in addition to Handle. To see what this looks like, take a look at the metadata of this dataset by Ishiyama (2011).

Researchers/research institutions can also register their data collections via SURF and make them accessible with the aid of PIDs (SURF. n.d.b.).

The PID forum

The PID-forum (n.d.) is a stand-alone discussion forum on 'all things PID' that originated from the FREYA project (n.d.).

PIDs for software, posters, presentations and other research output

It makes sense to not only make research data available but also software code, posters and other research output. Some examples:

Software code
You can make software code citable by publishing the code from GitHub to Zenodo. GitHub has a DIY guide available (GitHub, 2016).
Posters and other research output
Posters and presentations are often shared on Figshare (n.d.) or Zenodo (n.d.a). Within Zenodo you can also create a community (Zenodo, n.d.b.) where you curate the collection of output with a group of people.

Each upload gets its own PID in this way (both Figshare and Zenodo have the DOI as PID). For research output that does not automatically get a PID, this is an easy way to make that output findable, citable and more visible.

Sources

Click to open/close

Altmetrics (2010). Altmetrics: a manifesto. http://altmetrics.org/manifesto/

ANDS (n.d.). Building a culture of data citation. https://www.ands.org.au/__data/assets/pdf_file/0003/383025/data_citation_poster.pdf

ANDS. (2017). Data citation. ANDS Guide. awareness. https://www.ands.org.au/__data/assets/pdf_file/0005/724334/Data-citation.pdf

DANS (n.d.). Resolve identifier. http://www.persistent-identifier.nl/

DataCite (n.d.a.). https://datacite.org/

DataCite (n.d.b.). DataCite MDS API. https://mds.datacite.org/

DataCite (n.d.c.). DataCite - Cite Your Data. http://www.datacite.org.s3-website-eu-west-1.amazonaws.com/cite-your-data.html

DataCite (2019, Augustus 16th). Datacite Metadata Schema. Metadata Schema 4.4. https://schema.datacite.org/

DCP (n.d.) Persistent identifiers. https://dpconline.org/handbook/technical-solutions-and-tools/persistent-identifiers

Delft University of Technology (n.d.). DataCite Netherlands. https://www.tudelft.nl/en/library/support/datacite-netherlands/

DOI (n.d.) https://www.doi.org/

Figshare (n.d.). https://figshare.com/

FORCE 11 (2014). Joint Declaration of Data Citation Principle. - Final. https://www.force11.org/datacitationprinciples

FREYA (n.d.). The FREYA project. https://www.project-freya.eu/en/about/mission

GitHub (2016). Making your code citable. https://guides.github.com/activities/citable-code/

Haak, L., Meadows, A., Brown, J. (2018). Using ORCID, DOI, and Other Open Identifiers in Research Evaluation. Front. Res. Metr. Anal, vol 3, p28. https://doi.org/10.3389/frma.2018.00028

Ishiyama, T., Rieder, S., Makino, J., Zwart, S.P., Groen, D., Nitadori, K., Laat, C. de, McMillan, S., Hiraki, K., Harfst, S. (2011). The Cosmogrid Simulation: Statistical Properties of Small Dark Matter Halos (2048-103). Leiden University. 10.25606/SURF.578c6039-0bf84511

Keen, A.S (2011): Erosive Bar Migration Using Density and Diameter Scaled Sediment Erosive Profile Set-Prototype Scale (Actual Scal 1:10). TU Delft. doi:10.4121/uuid:32c53005-a4f2-447c-b231-6cdb7dcdd17f.

Meadows, Alice, Laurel L. Haak, and Josh Brown. 2019. “Persistent Identifiers: The Building Blocks of the Research Information Infrastructure”. Insights32 (1): 9. http://doi.org/10.1629/uksg.457

Netwerk Digitaal Erfgoed (n.d.).https://www.pidwijzer.nl/en/pid_results/new

ORCID. (n.d.). Register for an ORCID iD. Retrieved from https://orcid.org/register

PID Forum. (n.d.) https://www.pidforum.org/

Smith, A.M., Katz, D.S., Niemeyer, K.E., FORCE11 Software Citation Working Group. (2016) Software Citation Principles. PeerJ Computer Science 2:e86. https://doi.org/10.7717/peerj-cs.86

SURF (n.d.a.). SURF Data Archive. https://www.surf.nl/en/secure-long-term-storage-with-data-archive

SURF (n.d.b.). Data Persistent Identifier: data always findable by permanent references. https://www.surf.nl/en/data-persistent-identifier-data-always-findable-by-permanent-references

SURF (n.d.c.). SURF Data Repository. https://repository.surfsara.nl/

Zenodo (n.d.a.). https://zenodo.org/

Zenodo (n.d.b.). Zenodo Communities. https://zenodo.org/communities/

Access to (meta)data

Omhoog

Data archiving in practice