Removing relatives from UKB

Often analyses assume independence between observations and related individuals must be removed in advance. UKB is large and estimating relatedness within a diverse population such as UKB is challenging. UKB have already estimated the relatedness between individuals in UKB, an we will therefore use this preprepared file. Unfortunately this file is only provided with project specific IDs, so this must be done within each application, but in practise this should really be done for each study after the required phenotype data has been identified in order to maximise sample size of the analysis.

The application specific relatedness file provided by UKB is called can be downloaded as instructed here: https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=263. Ken has already downloaded this file for the ukb18177 application.

I have created a script called ukb_relative_remover.R, which formats the UKB relatedness file, and then uses Greedy related with or without a phenotype file, and relatedness threshold specified. The script takes less than 1 second to run.

mkdir /scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover

sbatch -p brc,shared --mem=1G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/UKB-GenoPrep/Scripts/ukb_relative_remover/ukb_relative_remover.R \
  --rel_file /scratch/datasets/ukbiobank/ukb18177/raw/ukb18177_rel_s488264.dat \
  --rel_thresh 0.044 \
  --GreedyRelated /scratch/groups/ukbiobank/Edinburgh_Data/Software/tools/GreedyRelated/GreedyRelated \
  --output /scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/ukb18177

# You can also specify a keep file (--keep) to only remove related individuals within a subset of UKB.
# The output contains two columns containing the application specific ID of participants.

Check the proportion of individuals from each population removed due to relatedness

library(data.table)
# Read in list of related individuals
rel<-fread('/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/ukb18177.related')

# Read in the application specific fam
fam<-fread('/scratch/groups/ukbiobank/ukb18177_glanville/genotyped/ukb18177_glanville_binary_pre_qc.fam')
fam<-fam[,1]
fam$row_n<-1:nrow(fam)

# Read in the list of rows for each population and update to application specific IDs
n_remov<-NULL
for(pop in c('EUR','AFR','SAS','EAS','AMR')){
  pop_keep<-fread(paste0('/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/PostQC/UKB.postQC.',pop,'.keep'))
  pop_keep<-pop_keep[,1]
  pop_keep<-merge(pop_keep, fam, by.x='V1', by.y='row_n')
  pop_keep<-pop_keep[,c('V1.y','V1.y')]
  names(pop_keep)<-c('FID','IID')
  pop_keep_unrel<-pop_keep[!(pop_keep$FID %in% rel$V1),]
  
  n_remov<-rbind(n_remov, data.frame(Population=pop,
                                     N_before=nrow(pop_keep),
                                     N_after=nrow(pop_keep_unrel)))
}

n_remov$prop_remove<-1-(n_remov$N_after/n_remov$N_before)

write.csv(n_remov, '/scratch/groups/ukbiobank/usr/ollie_pain/ReQC/relative_remover/Rel_check_per_pop.csv', col.names=F, row.names=F, quote=F)

N individuals per population before and after removal of related individuals
Population	N_before	N_after	prop_remove
EUR	445255	373841	0.1603890
AFR	7924	7337	0.0740787
SAS	9688	8931	0.0781379
EAS	2502	2452	0.0199840
AMR	1904	1873	0.0162815