Within King’s College London (KCL) and within other institutes, a single UKB genotype dataset is shared among multiple applications to avoid institutes to house multiple copies of the very large dataset. Each application will recieve a .fam and .sample file linking their application specific IDs to the genotype data. Therefore, as long as the order of individuals is maintained, the genotype-only data derivatives can also be shared across applications. For example, UKB genotype data that has undergone further quality control (QC) can be shared across applications, rather than each application generating their own version of the dataset. This saves times, helps people without the required expertise to use UKB genetic data, and ensures consistency across applications. However, it is essential that no application specific data is used when deriving data shared across applications, as this breaks the data agreement with UKB.

When recieved, the UKB genotype data is split into two folders:

  • Genotypes - Contains observed (no imputation) genotype data
  • Imputed - Contains imputed genotype data

1 Observed genotype data

Here, the files of main interest are the binary plink format data merged across all chromosomes:

  • ukb_binary_v2.bed
  • ukb_binary_v2.bim

2 Imputed genotype data

Here, the files of main interest are the following:

  • ukb_imp_chr*_v3.bgen - dosage values for imputed variants split by chromosome
  • ukb_mfi_chr*_v3.txt - per variant information for imputed variants split by chromosome
  • ukb_sqc_v2.txt - per sample quality control information
  • ukb_imp_chr*_v3_MAF0_INFO7.bgen - dosage values for imputed variants split by chromosome but restricted to variants with MAF > 0.01 and INFO > 0.7 (made by Joni)