Often analyses assume independence between observations and related individuals must be removed in advance. UKB is large and estimating relatedness within a diverse population such as UKB is challenging. UKB have already estimated the relatedness between individuals in UKB, an we will therefore use this preprepared file. Unfortunately this file is only provided with project specific IDs, so this must be done within each application, but in practise this should really be done for each study after the required phenotype data has been identified in order to maximise sample size of the analysis.

The application specific relatedness file provided by UKB is called can be downloaded as instructed here: https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=263. Ken has already downloaded this file for the ukb18177 application.

I have created a script called ukb_relative_remover.R, which formats the UKB relatedness file, and then uses Greedy related with or without a phenotype file, and relatedness threshold specified. The script takes less than 1 second to run.

mkdir /scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover

sbatch -p brc,shared --mem=1G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/UKB-GenoPrep/Scripts/ukb_relative_remover/ukb_relative_remover.R \
  --rel_file /scratch/datasets/ukbiobank/ukb18177/raw/ukb18177_rel_s488264.dat \
  --rel_thresh 0.044 \
  --GreedyRelated /scratch/groups/ukbiobank/Edinburgh_Data/Software/tools/GreedyRelated/GreedyRelated \
  --output /scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/ukb18177

# You can also specify a keep file (--keep) to only remove related individuals within a subset of UKB.
# The output contains two columns containing the application specific ID of participants.

Check the proportion of individuals from each population removed due to relatedness

library(data.table)
# Read in list of related individuals
rel<-fread('/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/ukb18177.related')

# Read in the application specific fam
fam<-fread('/scratch/groups/ukbiobank/ukb18177_glanville/genotyped/ukb18177_glanville_binary_pre_qc.fam')
fam<-fam[,1]
fam$row_n<-1:nrow(fam)

# Read in the list of rows for each population and update to application specific IDs
n_remov<-NULL
for(pop in c('EUR','AFR','SAS','EAS','AMR')){
  pop_keep<-fread(paste0('/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/PostQC/UKB.postQC.',pop,'.keep'))
  pop_keep<-pop_keep[,1]
  pop_keep<-merge(pop_keep, fam, by.x='V1', by.y='row_n')
  pop_keep<-pop_keep[,c('V1.y','V1.y')]
  names(pop_keep)<-c('FID','IID')
  pop_keep_unrel<-pop_keep[!(pop_keep$FID %in% rel$V1),]
  
  n_remov<-rbind(n_remov, data.frame(Population=pop,
                                     N_before=nrow(pop_keep),
                                     N_after=nrow(pop_keep_unrel)))
}

n_remov$prop_remove<-1-(n_remov$N_after/n_remov$N_before)

write.csv(n_remov, '/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/Rel_check_per_pop.csv', col.names=F, row.names=F, quote=F)
N individuals per population before and after removal of related individuals
Population N_before N_after prop_remove
EUR 445255 373841 0.1603890
AFR 7924 7337 0.0740787
SAS 9688 8931 0.0781379
EAS 2502 2452 0.0199840
AMR 1904 1873 0.0162815