Research Data Management: Data Description
Data description & collection/reuse of existing data
Describe how new data will be collected or produced and/or how will existing data be re-used
Points to consider:
- Explain which methodologies or software will be used if new data are collected or produced.
- State any constraints on re-use of existing data if there are any.
- Explain how data provenance will be documented.
- Briefly state the reasons if the re-use of any existing data sources has been considered but discarded.
Describe what data (for example the kind, formats, and volumes), will be collected or produced
Points to consider:
- Your DMP can be used as an inventory of datasets across a project.
- Give details on the kind of data, for example numeric (databases, spreadsheets), textual (documents), image, audio, video, and/or mixed media.
- Give details on the data format, the way in which the data is encoded for storage, often reflected by the filename extension (for example pdf, xls, doc, txt, or rdf).
- Justify the use of certain formats. For example decisions may be based on staff expertise within the host organisation, a preference for open formats, standards accepted by data repositories, widespread usage within the research community, or on the software or equipment that will be used.
- Give preference to open and standard formats as they facilitate sharing and long-term reuse of data (several repositories provide lists of such ‘preferred formats’).
- Give details on the volumes (they can be expressed in storage space required (bytes), and/or in numbers of objects, files, rows and columns) - the volume of data you anticipate generating will have an impact on the storage solution needed for the project.
Advantages of using existing data
- Datasets may impossible to create within the scope of your research project
- It can be cost effective & time saving to use data which has already been collected
- Ethical issues about data collection have already been dealt with
- You can spend bulk of time analysing data
- There is a huge breadth of data available, even in an Irish context
Disadvantages of using existing data
- The data were not collected to answer your specific research questions.
- Particular information may not have been collected
- The data may refer to a different geographic region than you are interested in studying
- The data may refer to a different time period than you are interested in studying
- Variables may have been defined or categorised differently than you would have liked
- You were not directly involved in the data collection process
- There may have bee a low response rate
- Anonymisation may be quite extensive, so variables you are interested in may not be available
When using existing data it is essential to:
- Ensure you have permission to use/remix/publish the data
- Check out the associated documentation for collection procedures, data cleaning procedures and other technical information.
- Spend time getting to know and understanding the data.
- Be practical about whether data are suitable (good enough) for your research.
Documentation helps you to understand the meaning of the data & to evaluate suitability for your research question. It can help you understand exactly what information was collected, from whom, where & when, as well as what was done to the resulting data before it was archived.
Documentation can include:
- Study description (metadata)
- User guide
- Codebook or data dictionary
- Survey questions
- Official reports
Citing existing data used in your research
When using existing data ensure to cite the dataset and acknowledge the data authors and repository/archive used to obtain the data.
Data citations should include the following components:
Data author(s), Full Title of the Dataset, Persistent Identifier, Data Repository or Archive, Version
Some authors or publish may require more components in there data citations. Ensure to check before citing.
See some of citation guide provide by different repositories below.
- Dataverse Data Citation Guide
Things to consider when choosing a file format:
-
How you plan to analyse your data
- Which software and file formats you and your colleagues have used in the past
- Any discipline specific norms or technical standards
- Whether file formats are at risk of obsolescence because of their dependence on a particular technology.
- Which formats are best to use for the long-term preservation of data
- Whether important information might be lost by converting between different formats
- The possibility of embedding metadata that describes content within the file itself, e.g. creator information, variable names and labels
Sometimes it is useful to store your data using one format for data collection and analysis and also in a more open or accessible format for sharing or archiving once your project is complete. If it is your intention to share your data our chosen Archive or Repository will likely have recommended file formats based on best practice within the disciplines they support.
Choosing file formats
When choosing file formats for research data it's important to consider whether the format is:
-
Open & non-proprietary
- Ubiquitous
- Uncompressed or lossless
File formats that are open or non-proprietary will tend to retain a good chance of remaining accessible, even if the software that created them is no longer available. Specialised proprietary formats used only by a niche set of users may present problems for future use. Formats which are ubiquitous or have become the default standard within a discipline, whether proprietary or not, are also more likely to be maintained into the future. This is important whether you plan on sharing and archiving your data at the end of you research project or whether you simply want the data to remain accessible by yourself and other researchers in your department.
- Proprietary format: Photoshop .psd file
- Open format: .tiff image file
Formats that are compressed or 'lossy' are often smaller in file size but the data are compressed as part of the encoding process, resulting in a data essentially being thrown away.
- Lossy formats: .mp3 audio file, .jpeg image file
- Lossless formats: .wav audio file, .tiff image file
Choosing a file format
If you aren't aware of any standards within your discipline the following is a good reference point:
- Textual data: eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml), Plain text data, ASCII (.txt), PDF/A (.pdf, Archival PDF)
- Tabular data with extensive metadata: Delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information
- Tabular data with minimal metadata (including spreadsheets): Comma-separated values (CSV) file (.csv)
- Databases: eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml), Comma-separated values (CSV) file (.csv)
- Images: TIFF version 6 uncompressed (.tif), JPEG (.jpeg, .jpg) (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if you are not concerned about image quality)
- Audio: Free Lossless Audio Codec (FLAC) (.flac), Waveform Audio Format (WAV) (.wav), MPEG-1 Audio Layer 3 (.mp3) but only if created in this format
Examples of research data:
| ● | Interviews | |
| ● | Diaries | |
| ● | Anthropological field notes | |
| ● | Focus groups | |
| ● | Answers to survey questions | |
| ● | Transcribed test responses | |
| ● | Coded numerical responses to surveys | |
| ● | Digital audio or video recordings | |
| ● | Digital images | |
| ● | Database contents | |
| ● | Digital models, algorithms or scripts | |
| ● | Maps & geospatial data | |
| ● | Ephemera | |
| ● | Archival material | |
| ● | Text documents, notes | |
| ● | Numerical data | |
| ● | Questionnaires, surveys, survey results | |
| ● | Audio and video recordings, photos | |
| ● | Database content (video, audio, text, images) | |
| ● | Mathematical models, algorithms | |
| ● | Software (scripts, input files ...) | |
| ● | Results of computer simulations | |
| ● | Laboratory protocols | |
| ● | Methodological descriptions | |
| ● | Sequence data |
Research records important to manage throughout the research lifecycle and beyond e.g.
| ● | Correspondence (electronic & paper) |
| ● | Project files |
| ● | Grant applications |
| ● | Ethics applications |
| ● | Technical reports |
| ● | Research reports |
| ● | Master lists |
| ● | Research reports |
| ● | Signed consent forms |
File Format Policy Examples
-
Library of Congress Recommended Formats StatementThe Library of Congress identified preferred and acceptable file formats for textual works and musical compositions, still image works, audio works, moving image works, software and electronic gaming and learning, datasets/databases and websites.
-
UK Data Service Recommended FormatsThis table contains guidance on file formats recommended and accepted by the UK Data Service for data sharing, reuse and preservation.
-
ISSDA File Format Policy [pdf]ISSDA preferred and acceptable file formats for quantitative data in the Social Sciences.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License