"If you don’t think you have a quality problem with your data, you haven’t looked at it yet." - Jeni Tennison(1)
How do researchers collect data? It is done in one of two ways(2):
By using existing data.
Existing data ('secondary data') is:
- Data collected by others, such as data collected by large organizations, e.g. CBS (Central Bureau of Statistics, Netherlands), the Kadaster and ministries. Centerdata(3), for instance, is an institute that collects and analyses (panel) data and makes it available for scientific research.
- Research data that have been stored in data archives by other researchers and can be reused.
By collecting their own research data.Created data ('primary' data) are the primary results of all types of research. This data is produced by the researcher. A distinction can be made between raw data and processed data, but in pratice this is not a distinct divide. Measuring equipment becomes more and more advanced and part of the processing may have been done by the equipment before the data are even available. In fact, the raw data has already been partly processed.
There are roughly five different ways to create research data:
- Through observation.
This type of data can generally be collected once and is, therefore, unique. For example, climate data, astronomical observations, archaelogical excavations, opinion polls, surveys.
- By experimenting.
Data collected through experiments (with the aid of lab equipment). For example the synthesis of new molecules, gene sequence analysis and psychological tests. In gerenal, these experiments can be repeated.
- By simulation (test models).
Climate models and economic models, for example. The results of simulations can usually be reproduced. It is more useful to store the model and metadata itself than the data resulting from the simulations.
- By data processing.
Combining, reprocessing, (re)grouping etc. of data created before.
- By researching sources.
For example data deriving from archive and literature research in order to compose texts, or series of 'measurable' data from archived material, manuscripts and (professional) publications. Specialist queries of large linguistic databases are also an example.
Researchers collecting their own researh data have to take a variety of factors into account. For instance, has the measuring equipment been properly calibrated? And does it measure what it is supposed to? Are the survey questions not too directing? Questions like these are essential to data quality, but are not included in this course.