Research Data Management: File Formats

Bringing together University resources and services to facilitate researchers in the production of high quality data

File Formats

File formats that are in widespread use, or are non-proprietary (e.g. open source – not owned by someone), will tend to retain a good chance of being rendered in the future. Specialised proprietary formats used only by a niche set of users may present problems for future use. However there are often trade-offs. Open formats may not support all the functionality found within a proprietary format, or they might result in larger files because they offer less efficient compression of files. Sometimes, you will want to store your data in its original format and also in a more open or accessible format for sharing or archiving.    

When choosing file formats you will need to consider the following:

  • How you plan to analyse, sort, and store your data
  • Which software and file formats you and your colleagues have used in the past
  • Any discipline-specific norms or technical standards (and the associated peer-to-peer support that comes with them) 
  • Whether file formats are at risk of obsolescence, because of new versions or their dependence on particular technology. ‘Open’ formats are better because they can be used by anyone and supported by any software developer, free of charge. This makes their rapid obsolescence very unlikely
  • Which formats are best to use for the long-term preservation of data
  • Whether important information might be lost by converting between different formats
  • The possibility of embedding metadata that describes content within the file itself, e.g. creator information, row/column descriptors
  • Bear in mind that you may be better off using one format for data collection and analysis and converting your data to another format for sharing, once your project is complete

Converting or Migrating Files

At some time during your research you may need to convert or migrate your data files from one format to another. This may be due to a new computer, new software, sharing with someone who has different software, working on a shared platform instead of your own PC, or simply in order to ensure that your data can be read and used in the future.

Some “lossiness” (i.e. reduction in quality) may occur when migrating from one file format to another. It is important for you to understand what is at risk for the type of data you are working with.

Potential risks for loss or corruption on conversion or migration to new media include the following:

  • Textual data: editing such as highlighting, bold text or headers/footers may be lost
  • Data held in statistical packages, spreadsheets or databases: some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated
  • Image files: loss of layers, colour fidelity, resolution etc.
  • Multimedia: as above, but attention to frame rates, sound quality, codecs and wrappers is needed.

It is worth briefing yourself on the format you are converting from and to before you begin; at least look them up on the Web. 

Check the integrity of converted files as thoroughly as possible immediately afterwards, e.g. by counting rows and columns, testing functionality, testing export, etc. 'Eyeball' the data too.


File Formats Chart

Type of data Acceptable formats for sharing, reuse and preservation Other acceptable formats for data preservation

Quantitative tabular data with extensive metadata:

A dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data

SPSS portable format (.por)

Delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information

Some structured text or mark-up file containing metadata information, e.g. DDI XML file

Proprietary formats of statistical packages e.g. SPSS (.sav), Stata (.dta)
MS Access (.mdb/.accdb)

Quantitative tabular data with minimal metadata:

A matrix of data with or without column headings or variable names, but no other metadata or labelling

Comma-separated values (CSV) file (.csv)

Tab-delimited file (.tab)
including delimited text of given character set with SQL data definition statements where appropriate

Delimited text of given character set - only characters not present in the data should be used as delimiters (.txt)

Widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)

Geospatial data:

Vector and raster data

ESRI Shapefile (essential - .shp, .shx, .dbf, .prj; optional - .sbx, .sbn)

Geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

Tabular GIS attribute data

ESRI Geodatabase format (.mdb)

MapInfo Interchange Format (.mif) for vector data

Keyhole Mark-up Language (KML) (.kml)

Adobe Illustrator (.ai), CAD data (.dxf or .svg)

Binary formats of GIS and CAD packages

Qualitative data:


eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)

Rich Text Format (.rtf)

Plain text data, ASCII (.txt)

Hypertext Mark-up Language (HTML) (.html)

Widely-used proprietary formats, e.g. MS Word (.doc/.docx)

Some proprietary/software-specific formats, e.g. NUD*IST, NVivo and ATLAS.ti

Digital image data TIFF version 6 uncompressed (.tif)

JPEG (.jpeg, .jpg) but only if created in this format

TIFF (other versions) (.tif, .tiff)

Adobe Portable Document Format (PDF/A, PDF) (.pdf)

Standard applicable RAW image format (.raw)

Photoshop files (.psd)

Digital audio data Free Lossless Audio Codec (FLAC) (.flac)

MPEG-1 Audio Layer 3 (.mp3) but only if created in this format

Audio Interchange File Format (AIFF) (.aif)

Waveform Audio Format (WAV) (.wav)

Digital video data MPEG-4 (.mp4)
motion JPEG 2000 (.mj2)
Documentation and scripts

Rich Text Format (.rtf)

PDF/A or PDF (.pdf)

HTML (.htm)

OpenDocument Text (.odt)

Plain text (.txt)

Some widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/.xlsx)

XML marked-up text (.xml) according to an appropriate DTD or schema, e.g. XHMTL 1.0

