With this new wave of QC, there is an emphasis on retaining all individuals in the dataset, so selection of individuals can occur at the point of analysis. This is mainly to avoid removing non-European individuals prior to QC, so they can be more readily included in analyses. I propose that we stratify UKB participants into the five 1KG super populations (EUR, AMR, EAS, SAS, AFR), and then perform super population specific quality control.
A model predicting super population membership can be derived using the 1KG Phase 3 sample, as these individuals are own known ancestry. First, principal components (PCs) should be derived in the 1KG Phase 3 sample, then a model is derived predicting super population membership based on the PC scores. The 1KG Phase 3 PCs can then be projected into UKB, to estimate PC scores in UKB on the same axis as in the 1KG Phase 3. Then the super population membership of each individual can calculated using the model derived in the 1KG Phase 3 sample.
The same variants must be used to define PCs in 1KG, as they are in the UKB. To define these variants, we should perform basic SNP-level QC in UKB, and then identify the intersect with the 1KG Phase sample. The variants must be LD independent for PC estimation, and this should be determined using the 1KG Phase 3 sample, as this is where the PCs are being derived.
To define the intersect between UKB and 1KG, I would use the autosomal UKB imputed data in hard call format, with fairly strict SNP-level QC (pre-conversion QC: MAF > 0.05, INFO = 1, intersect with 1KG MAF > 0.05; post-conversion QC: hard-call thresh = 0.95, geno < 0.02, hwe > 1e-6). bgen conversion and QC can be implemented using qctool V2.
I have writen a script here, which finds the intersect between the target and reference sample, defines N PCs in the reference, derives a multinomial elastic net model predicting super population membership, projects the PCs into the target sample, and then computes super population probabilities for each target sample. The script provides information on the accuracy of the model in the reference sample.
One consideration is that, by default, the ancestry_identifier script uses an elastic net model, wherease PanUKBB used an random forest. I expect the performance of these algorithms to be similar but this can be tested. PanUKBB use only the first 6 PCs as they find that using more than this does not improve performance. PanUKBB use a 50% probability threshold to define super populations.
Here we will use Joni’s original QC script to QC each superpopulation. Details are provided in the following file: /scratch/groups/ukbiobank/SGDP_Central_Genotype_QC.Rmd