People say jargon is a bad thing, but it's really a shortcut vocabulary professionals use to understand one another | Erin McKean
In this section you will find a list of terms that are often used by data supporters. In addition, in three underlying sections we pay extra attention to a number of basic concepts that form the visible and invisible thread of this course.
Using the menu in the column on the left, you can go directly to the section of your choice. Or you can scroll down and continue to the next paragraph. In this way you won't miss anything.
List of terms
Below you will find a list of terms that you can refer to during the course. Alternatively, you can also consult the list of terms by CASRAI (n.d), the LCRDM glossary (n.d.) the one from DCC (n.d.) or the Data Glossary from Science Europe (n.d.).
A data archive with a CoreTrustSeal (CTS) certification complies with requirements ensuring that in the future, research data can still be processed in a high-quality and reliable manner.
A data archive is a facility which moves data to an environment for long-term retention. A data archive is indexed and has search facilities, enabling data to be retrieved.
The way in which data or information is coded and stored. A data format (or file format) gives information on how to process the data.
A copy of the data for the purpose of creating a duplicate dataset.
Data management plan (DMP)
A DMP is a written agreement describing the research project, the type and volume of data produced and stating which data will be saved, how they will be saved (file format, version control, metadata), whether and when data will be submitted to a repository and under which terms. If necessary, it describes the tools (hardware and software) that are required to (re)use the data.
Data Protection Impact Assessement (DPIA)
If data processing poses a high privacy risk for the participants in a study, then it is necessary, in accordance with Article 35 of the GDPR, to perform a Data Protection Impact Assessment (DPIA). A DPIA is carried out to assess "the origin, nature, particularity and seriousness of the risk to the rights and freedoms of natural persons". The result of the assessment must be taken into account when determining the correct measures to process the personal data in order to reduce privacy risks.
Erasmus University offers a decision tree to decide when you need a DPIA (ERIM, 2018).
Data provenance is providing a historical record of the data and its origins. It refers to the process of tracing and recording the origins of data and its movement between databases. (Buneman, 2000).
A general term for a location to store data. A data repository with a policy for long-term preservation is called a data archive.
People tweeting about data.
The DOI (Digital Object Identifier) is a unique and stable identifier that ensures that a digital object can be permanently found on the World Wide Web, regardless of changes in the URL. A central registry ensures that the user of a DOI will be referred to its current location.
FAIR data is data which is Findable, Accessible, Interoperable and Reusable (GoFAIR, n.d.) The ‘FAIR Guiding Principles for scientific data management and stewardship’ (Wilkinson, 2016) provide guidelines to improve the findability, accessibility, interoperability, and reuse of digital assets.
The General Data Protection Regulation (GDPR, European Union, 2016) protects the privacy rights of individuals and assigns responsibilities to those who process the personal data of others. The GDPR has been in force since May 2018. See the corresponding paragraph in this course.
The integrity of research is based on adherence to core values— objectivity, honesty, openness, fairness, accountability, and stewardship. These core values help to ensure that the research enterprise advances knowledge | Fostering integrity in research, 2017
Linked data (n.d.) is a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information and knowledge on the Semantic Web using RDF. Linked data refers to data published on the web in such a way that it is machine-readable, that its meaning is explicitly defined, that it is linked to other external data sets, and that in turn it can be linked to from external data sets.
A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike. (Open Knowledge Foundation, n.d.)
According to Foster Open Science (n.d.) Open science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools (European Commission, 2016b:33). The OECD defines Open Science as: “to make the primary outputs of publicly funded research results – publications and the research data – publicly accessible in digital format with no or minimal restriction” (OECD, 2015:7), but it is more than that. Open Science is about extending the principles of openness to the whole research cycle (see figure 1), fostering sharing and collaboration as early as possible thus entailing a systemic change to the way science and research is done.
Persistent identifier (PID)
A unique code that is coupled to a digital object. With this code, the object can be identified even when the object is moved to a different location. The DOI (Digital Object Identifier) is an example of a persistent identifier.
A file format that has, to the best of our knowledge at this moment, the best chances of being useable in the (far) future.
Preregistration of an analysis plan is committing to analytic steps without advance knowledge of the research outcomes. That commitment is usually accomplished by posting the analysis plan to an independent registry such as https://clinicaltrials.gov/ or https://osf.io (Nosek, 2018).
When speaking about preservation, two distinct perspectives are distinguished:
- Short-term preservation: Keeping data available in its present shape. This is also called data archiving.
- Long-term preservation: Keeping data available in a usable shape for future users.
Keeping data in its present shape, means protecting data from incidental loss and making data findable through proper metadata. Long-term preservation adds the task of changing the data format in a reliable way and being accountable for all manipulations in order to keep the data in a shape that is demanded by future software or future working practices of the designated community (the intended audience for the data).
RDF is a standard model for data interchange on the Web (W3C, n.d.) With RDF, relationships between digital objects are defined.
First, every digital object receives a URI: a Uniform Resource Identifier that defines the location and approach of a particular source. In many cases the URI is a URL. Each digital object is then linked to other objects by means of so-called RDF triples. Simply put, an RDF triple says: object X has relationship Y with object Z. This way of representing relationships is also known as linked data.
Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research (The Open Science Training Book, 2018).
Research data are facts, observations or experiences on which an argument or theory is based. (ANDS, 2017)
The research lifecycle is the process that a researcher takes to complete a project or study from its inception to its completion. Research data management is involved in each step of the research process (NNLM, n.d.)
Text- and data mining
Text and data mining (TDM) is the process of deriving information from machine-read material. It works by copying large quantities of material, extracting the data, and recombining it to identify patterns (UK Government, n.d.).
Trusted digital repository (TDR)
A certified digital repository that has been set up to provide reliable, sustainable access to the data deposited. TDRs may be certified at three levels:
- A Basic Certification with Core Trust Seal certification;
- An extended Certification after the repository performs a self-audit in accordance with ISO 16363 (or DIN 31644);
- A Formal Certification granted on top of an Extended Certification after an additional external audit and certification according tot ISO 16363 or DIN 3164410.
Virtual Research Environment (VRE)
A Virtual Research Environment (VRE) is a virtual working environment for researchers. A VRE combines various tools for data management in one environment, thus supporting the researcher's workflow and providing a safe working environment.
Click to open/close
ANDS (2017). ANDS Guides and Resources. What is research data. https://www.ands.org.au/guides/what-is-research-data
Buneman P., Khanna S., Tan WC. (2000) Data Provenance: Some Basic Issues. In: Kapoor S., Prasad S. (eds) FST TCS 2000: Foundations of Software Technology and Theoretical Computer Science. FSTTCS 2000. Lecture Notes in Computer Science, vol 1974. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44450-5_6 (Availabel at http://db.cis.upenn.edu/DL/fsttcs.pdf)
CASRAI (n.d.). Category:Terms. https://dictionary.casrai.org/Category:Terms
DCC (n.d.). Glossary. http://www.dcc.ac.uk/digital-curation/glossary
Erasmus University (2018). Personal Data and Privacy Impact Assessment in Research. https://www.erim.eur.nl/fileadmin/user_upload/Privacy_assessment.pdf
Force 11. (2014). The FAIR data principles. Retrieved from https://www.force11.org/group/fairgroup/fairprinciples
Foster open science (n.d.). Wat is open science. Introduction. https://www.fosteropenscience.eu/content/what-open-science-introduction
Foster Open Science (2018). Open Science Training book. 4. Reproducible Research and Data Analysis. https://book.fosteropenscience.eu/en/02OpenScienceBasics/04ReproducibleResearchAndDataAnalysis.html
GO FAIR (n.d.) Fair principles. Retrieved from https://www.go-fair.org/fair-principles/
LCRDM (n.d.) LCRDM Glossary. https://www.lcrdm.nl/en/glossary
NNLM (n.d.). Research Lifecycle. https://nnlm.gov/data/thesaurus/research-lifecycle
Nosek, B.A., Ebersole, C.R, DeHaven, A.C., Mellor, D.T. (2018). The preregistration revolution. PNAS March 13, 2018 115 (11) 2600-2606. Retrieved from https://doi.org/10.1073/pnas.1708274114
National Academies of Sciences, Engineering, and Medicine. 2017. Fostering Integrity in Research. Washington, DC: The National Academies Press. https://doi.org/10.17226/21896
Open Knowledge Foundation (n.d.). The open definition. https://opendefinition.org/
Science Europe (n.d.). Science Europe Data Glossary. http://sedataglossary.shoutwiki.com/wiki/Main_Page
UK Government (n.d.). Text Mining and Data Analytics in Call for Evidence Responses. https://webarchive.nationalarchives.gov.uk/20140603125140/http://www.ipo.gov.uk/ipreview-doc-t.pdf
W3C (n.d.). RDF. https://www.w3.org/RDF/
Wilkinson, M.D. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3. Retrieved from https://dx.doi.org/10.1038/sdata.2016.18