This page describes calculating reference-standardised genotype-based scores in a target sample. This process requires a number of reference files, which can be made using instructions found here. Here, we will calculate ancestry scores, polygenic scores, and functionally informed polygenic scores. Each of these scores can be used for a number of tasks, such as inferrence of ancestry, estimation of risk, test for genetic covariance between two outcomes, and stratification.

1 Harmonisation with the reference

Before these scores can be calculated, the target samples genetic data must be harmonised with the reference. This process varies depending on the source of the data. Here, I show how the UK biobank and Twins Early Development Study (TEDS) genetic data were harmonised with the reference. Note, both of these samples had been imputed prior to receiving the data.

In all scenarios, I extract SNPs in the HapMap3 SNP-list, convert the data to hard-call PLINK binaries (.bed/.bim/.fam), check whether any SNPs need to be strand flipped, and insert HapMap3 SNPs that are not present into the target data as missing.

I use HapMap3 SNPs only as these are typically available after imputation, enabling comparison of scores from diverse sources. I convert SNPs to hard call with a hard-call threshold of 0, meaning it will take the most likely genotype regardless of the certainty, which speeds subsequent stages of the genotype-based scoring and is unlikely to introduce many errors as these SNPs are typically reliable. I insert HapMap3 SNPs as missing to enable allele frequency imputation while scoring, which ensures scores are comparable between samples with different SNPs available.

1.1 UK Biobank

The code below applies a script called ‘Harmonisation_of_UKBB.R’ to each chromosome of the UK-Biobank data. See here for more information on this script.

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Process each chromosome of the UK-Biobank data using the Harmonisation_of_UKBB.R script
for chr in $(seq 1 22); do
qsub -l h_vmem=10G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Harmonisation_of_UKBB/Harmonisation_of_UKBB.R \
  --chr $chr \
  --input_dir ${UKBB_original} \
  --target_fam ${UKBB_fam} \
  --output_dir ${UKBB_output}/Genotype/Harmonised \
  --reference_dir ${Geno_1KG_dir} \
  --qctool2 ${qctool2} \
  --plink ${plink1_9}
done

1.2 TEDS

The TEDS data has already undergone pre-imputation QC, HRC imputation, and post-imputation QC. Genotyping was performed using two chips, Affy and OEE. Each Chip was QC’d and imputed seperatly, and then merged retaining the non-overlapping SNPs to maximise the SNP data available for analysis. The data is in plink format. Here we will harmonise the data with the HapMap3 reference.

See code


qsub -l h_vmem=20G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Harmonisation_of_TEDS/Harmonisation_of_TEDS.R

1.3 23andMe

1.3.1 Format and impute

Here we will prepare genotype data for Oliver_Pain (Me), using the script 23andMe_imputer.R. This code is very similar to that used by the Impute.Me website created by Lasse Folkersen.

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for individual in $(echo Esther_Anon); do
geno_file=$(ls ${geno_23andMe}/${individual})

# Run for one user
qsub -l h_vmem=15G -pe smp 6 /users/k1806347/brc_scratch/Software/Rscript.sh  /mnt/lustre/users/k1806347/Software/MyGit/GenoPred/Scripts/23andMe_imputer/23andMe_imputer.R \
--FID ${individual} \
--IID ${individual} \
--geno ${geno_23andMe}/${individual}/${geno_file} \
--plink ${plink1_9} \
--out ${geno_23andMe}/${individual}/${individual} \
--ref ${Impute2_1KG_dir} \
--shapeit ${ShapeIt} \
--impute2 ${Impute2} \
--n_core 6 \
--qctool ${qctool2}

done

#######
# Extract HapMap3 SNPs and insert missing genotypes for subsequent MAF imputation.
#######
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for individual in $(echo Kathy_Pain Esther_Anon James_Pain Sophie_vonStumm); do
qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh  /mnt/lustre/users/k1806347/Software/MyGit/GenoPred/Scripts/hm3_harmoniser/hm3_harmoniser.R \
--target ${geno_23andMe}/${individual}/${individual}.1KGphase3.chr \
--ref ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
--plink ${plink1_9} \
--out ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr
done

2 Ancestry scoring

In genetic analyses it is important to consider an individual’s ancestry as the frequency of alleles vary between populations making comparison across ancestries often invalid. Differences in ancestry can lead to population stratification in genome-wide assocaition studies because the frequency of an outcome may vary between individuals of different ancestries, which will then lead to an association between the outcome and all genetic variants that differ in frequency between ancestral groups. Furthermore, differences in LD between ancestries lead to different patterns of association which influences the predictive utility of polygenic scores across diverse populations.

In this genotype-based scoring pipeline, we estimate factor scores for each individual on the first 100 European specific and 100 all-ancestry principal components (PCs) of populations structure by projecting previously identified PCs in the 1000 Genomes reference (see here). Using the 100 all-ancestry PCs, we estimate which super population individuals belong to. This allows subsquent genotype-based scores to be scaled based on the mean and SD of scores within their ancestral group, improving their interpretability. We also use their inferred ancestry to stratify individuals when building predictive models, to provide population specific models of risk. Furthermore, given that the frequency of outcomes do often vary between ancestral groups, we also derive these estimates of ancestry to be used as predictors of risk.

Given that these PCs are projected and we want them to comparable across target samples, the PCs are derived using a pre-specified subset of LD-independant HapMap3 SNPs. Given that not all SNPs are available in target samples, to avoid bias due to missing data we impute missing SNPs using the MAF from an ancestry match reference population. However, to ensure appriorate ancestry matching, we must first perform PCA using only SNPs that are available in the target sample. Therefore the projected PCs are calculated in two stages:

Ancestry idenitfication - PCs specific to the target sample are calcualted using available data, and compared to a reference to idenitfy the population each individual fits in. This section uses a script called ‘Ancestry_identifier.R’. See more information on the utility of this script here for more information on this script.
Project ancestry - PCs are projected from a reference population and are comparable across target samples. Missing genotypes are imputed using the ancestry matched reference MAF. This section uses a script called ‘scaled_ancestry_scorer.R’. See more information on the utility of this script here for more information on this script.

2.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

#################
# Idenitfy ancestry of UKBB participants
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Ancestry_identifier/Ancestry_identifier.R \
    --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
  --n_pcs 100 \
  --plink ${plink1_9} \
    --plink2 ${plink2} \
    --output ${UKBB_output}/Projected_PCs/Ancestry_idenitfier/UKBB.w_hm3.AllAncestry \
  --ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
  --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

#################
# Calculate all ancestry PCs of population structure 
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
    --target_keep ${UKBB_output}/Projected_PCs/Ancestry_idenitfier/UKBB.w_hm3.AllAncestry.EUR.keep \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec \
    --plink2 ${plink2} \
    --output ${UKBB_output}/Projected_PCs/EUR/AllAncestry/UKBB.w_hm3.AllAncestry \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.EUR.scale \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

# Old code: Use all ancestry MAF to impute missing genotypes
qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/AllAncestry/1KGPhase3.w_hm3.AllAncestry.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec \
    --plink2 ${plink2} \
    --output ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small \
    --pop_model ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.pop_enet_model.rds \
    --pop_model_scale ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.scale \
    --pop_scale_for_keep ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.scalefiles.txt

#################
# Calculate European ancestry PCs of population structure
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
    --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.scale \
    --plink2 ${plink2} \
    --output ${UKBB_output}/Projected_PCs/EUR/UKBB.w_hm3 \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

2.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

#################
# Idenitfy ancestry of TEDS participants
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Ancestry_identifier/Ancestry_identifier.R \
  --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
  --ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
  --plink ${plink1_9} \
  --n_pcs 100 \
    --plink2 ${plink2} \
    --output ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry \
  --ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
  --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

#################
# Calculate all ancestry PCs of population structure for Europeans
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.EUR.scale \
    --plink2 ${plink2} \
    --output ${TEDS_output_dir}/Projected_PCs/EUR/AllAncestry/TEDS.w_hm3.AllAncestry \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

module add general/R/3.5.0
module add compilers/gcc/8.1.0
module add general/python/2.7.10
R

opt$target_plink_chr<-'/users/k1806347/brc_scratch/Data/TEDS/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr'
opt$target_keep<-'/users/k1806347/brc_scratch/Data/TEDS/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep'
opt$ref_freq_chr<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr'
opt$ref_score<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec.var'
opt$ref_eigenvec<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.EUR.scale'
opt$plink2<-'/users/k1806347/brc_scratch/Software/plink2'
opt$output<-'/users/k1806347/brc_scratch/Data/TEDS/Projected_PCs/EUR/AllAncestry/TEDS.w_hm3.AllAncestry'
opt$pop_data<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/integrated_call_samples_v3.20130502.ALL.panel_small'

#################
# Calculate European ancestry PCs of population structure
#################

qsub -l h_vmem=15G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.scale \
    --plink2 ${plink2} \
    --output ${TEDS_output_dir}/Projected_PCs/EUR/EUR_specific/TEDS.w_hm3.EUR \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small

2.3 23andMe

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

#################
# Identify ancestry
#################

for individual in $(echo Oliver_Pain); do
sbatch -p brc,shared --mem=10G /users/k1806347/brc_scratch/Software/Rscript_singularity.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Ancestry_identifier/Ancestry_identifier.R \
    --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
  --ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
  --plink ${plink1_9} \
  --n_pcs 100 \
    --plink2 ${plink2} \
    --output ${geno_23andMe}/${individual}/Projected_PCs/Ancestry_identifier/${individual}.w_hm3.AllAncestry \
  --ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list \
  --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small
done

#################
# Calculate AllAncestry PCs of population structure
#################

for individual in $(echo Kaili_Rimfeld); do
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.eigenvec \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry/1KGPhase3.w_hm3.AllAncestry.EUR.scale \
    --plink2 ${plink2} \
    --output ${geno_23andMe}/${individual}/Projected_PCs/AllAncestry/${individual}.w_hm3.AllAncestry \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small
done

#################
# Calculate European ancestry PCs of population structure
#################

# Use the EUR.keep file from the AllAncestry run above to select people of EUR ancestry.
for individual in $(echo Kaili_Rimfeld); do
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_ancestry_scorer/scaled_ancestry_scorer.R \
    --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
    --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
    --ref_score ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec.var \
    --ref_eigenvec ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.eigenvec \
    --ref_scale ${Geno_1KG_dir}/Score_files_for_ancestry/EUR/1KGPhase3.w_hm3.EUR.scale \
    --plink2 ${plink2} \
    --output ${geno_23andMe}/${individual}/Projected_PCs/EUR/${individual}.w_hm3 \
    --pop_data ${Geno_1KG_dir}/integrated_call_samples_v3.20130502.ALL.panel_small
done

3 Polygenic scoring

Here we calculate poylgenic scores for a range of outcomes in the target samples based on previously prepared reference scoring files, and scale the polygenic scores based on scores in a reference sample with matched ancestry. For more information on preparing reference files for polygenic scoring, see here.

One aim of the GenoPred project is to compare methods for calculating genotype-based scores. Here we compare three approaches for polygenic scoring:

p-value thresholding and LD-based clumping (pT + clump)
lassosum
PRScs
SBLUP

3.1 pT + clump

This section uses a script called ‘Scaled_polygenic_scorer.R’. See more information on the utility of this script here for more information on this script.

3.1.1 UK Biobank

3.1.1.1 Sparse p-value thresholding

We will calculate PRS both without and with control for ancestry PCs,

See code without control of ancestry

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a list of gwas that haven't got corresponding PRS files in UKBB
> ${UKBB_output}/PolygenicScores/EUR/todo.txt
for gwas in $(cut -f 1 -d ' ' ~/brc_scratch/Data/GWAS_sumstats/Largest_GWAS_over15K_withoutUKBB.txt);do
if [ ! -f ${UKBB_output}/PolygenicScores/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.profiles ]; then
echo $gwas >> ${UKBB_output}/PolygenicScores/EUR/todo.txt
fi
done

# Calculate polygenic scores for for Europeans only for now.
n=0
for gwas in $(cat ${UKBB_output}/PolygenicScores/EUR/todo.txt);do
qsub -N $(echo PRS_job$(($n+10))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR
n=$(($n+1))
done

See code with control of ancestry

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a files containing all EUR and AllAncestry PCs
module add general/R/3.5.0
R

library(data.table)

EUR<-fread('/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/EUR/UKBB.w_hm3.EUR.eigenvec')
names(EUR)[-1:-2]<-paste0('EUR_',names(EUR)[-1:-2])

All<-fread('/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/EUR/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.eigenvec')
names(All)[-1:-2]<-paste0('AllAncestry_',names(All)[-1:-2])

PCs<-merge(EUR,All,by=c('FID','IID'))

write.table(PCs,'/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/UKBB.AllAncestry_EUR_PCs.EUR.eigenvec', col.names=T, row.names=F, quote=F)

q()
n

# Create a list of gwas that haven't got corresponding PRS files in UKBB
> ${UKBB_output}/PolygenicScores/EUR/todo.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01);do
if [ ! -f ${UKBB_output}/PolygenicScores/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.PCadjusted.profiles ]; then
echo $gwas >> ${UKBB_output}/PolygenicScores/EUR/todo.txt
fi
done

# Calculate polygenic scores for for Europeans only for now.
n=0
for gwas in $(cat ${UKBB_output}/PolygenicScores/EUR/todo.txt);do
qsub -N $(echo PRS_job$(($n+10))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
  --target_covar ${UKBB_output}/Projected_PCs/UKBB.AllAncestry_EUR_PCs.EUR.eigenvec \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.EUR.scale \
  --covar_model ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.models.rds \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.PCadjusted
n=$(($n+1))
done

module add general/R/3.5.0
module add compilers/gcc/8.1.0
module add general/python/2.7.10
R

opt$target_plink_chr<-'/users/k1806347/brc_scratch/Data/UKBB/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr'
opt$target_keep<-'/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep'
opt$target_covar<-'/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/UKBB.AllAncestry_EUR_PCs.EUR.eigenvec'
opt$ref_score<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/DEPR06/1KGPhase3.w_hm3.EUR_PC_resid.DEPR06'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/DEPR06/1KGPhase3.w_hm3.EUR_PC_resid.DEPR06.EUR.scale'
opt$ref_freq_chr<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr'
opt$covar_model<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/DEPR06/1KGPhase3.w_hm3.EUR_PC_resid.DEPR06.models.rds'
opt$plink<-'/users/k1806347/brc_scratch/Software/plink1.9.sh'
opt$pheno_name<-'DEPR06'
opt$output<-'/users/k1806347/brc_scratch/Data/UKBB/PolygenicScores/EUR/DEPR06/UKBB.w_hm3.DEPR06.EUR.PCadjusted'

gwas=EDUC03
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
  --target_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
  --target_covar ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry_EUR_PCs.EUR.EUR_scaled.eigenvec \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.EUR.scale \
  --covar_model ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.models.rds \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores/EUR/${gwas}/1KG.w_hm3.${gwas}.EUR.PCadjusted

module add general/R/3.5.0
R

library(data.table)
score<-fread('/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/UKBB.AllAncestry_EUR_PCs.EUR.eigenvec')


  --ref_plink_chr ${Geno_1KG_dir}/1KGPhase3.w_hm3.chr \
  --ref_keep ${Geno_1KG_dir}/keep_files/EUR_samples.keep \
  --sumstats ${gwas_rep}/${gwas}.sumstats.gz \
  --plink ${plink1_9} \
  --memory 3000 \
  --covar ${Geno_1KG_dir}/Score_files_for_ancestry/AllAncestry_EUR_PCs.EUR.eigenvec \
  --output ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas} \
  --ref_pop_scale ${Geno_1KG_dir}/super_pop_keep.list

3.1.1.2 Dense p-value thresholding

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

###
# Split the EUR UKBB individuals into 10K batches
###
# Temporary files would require a large amount of storage for all UKBB individuals
# Splitting the data into batches avoids this issue
mkdir /users/k1806347/brc_scratch/Data/UKBB/PolygenicScores_dense
split -l 10000 -d ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep ${UKBB_output}/PolygenicScores_dense/UKBB.w_hm3.AllAncestry.EUR.keep.Batch_

# Calculate polygenic scores for for Europeans only for now.
n=0
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01);do
for batch in $(ls ${UKBB_output}/PolygenicScores_dense/UKBB.w_hm3.AllAncestry.EUR.keep.Batch_* | cut -d. -f6-);do
qsub -N $(echo PRS_job$(($n+100))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/PolygenicScores_dense/UKBB.w_hm3.AllAncestry.EUR.keep.${batch} \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_dense/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_dense/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/${batch}/UKBB.w_hm3.${gwas}.EUR.${batch}
n=$(($n+1))
done
done

# Check all the batches finished correctly There should be 46 files
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01);do
ls ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/*/UKBB.w_hm3.${gwas}.EUR.*.profiles | wc -l
done

# Combine the batches into a single file
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01);do
head -1 ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/Batch_00/UKBB.w_hm3.${gwas}.EUR.Batch_00.profiles > ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.profiles
tail -n +2 -q ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/*/UKBB.w_hm3.${gwas}.EUR.*.profiles >> ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.profiles
done

# Delete per batch files and move log files into a single folder
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01);do
mkdir ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/log_files
mv ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/*/UKBB.w_hm3.${gwas}.EUR.*.log ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/log_files/
rm -r ${UKBB_output}/PolygenicScores_dense/EUR/${gwas}/Batch_*
done

3.1.2 TEDS

3.1.2.1 Sparse p-value thresholding

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

mkdir -p ${TEDS_output_dir}/PolygenicScores/EUR

# Create a list of gwas that haven't got corresponding PRS files in TEDS
> ${TEDS_output_dir}/PolygenicScores/EUR/todo.txt
for gwas in $(cut -f 1 -d ' ' ~/brc_scratch/Data/GWAS_sumstats/Largest_GWAS_over15K_withoutTEDS.txt);do
if [ ! -f ${TEDS_output_dir}/PolygenicScores/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR.profiles ]; then
echo $gwas >> ${TEDS_output_dir}/PolygenicScores/EUR/todo.txt
fi
done

# Calculate polygenic scores for for Europeans only for now.
n=0
for gwas in $(cat ${TEDS_output_dir}/PolygenicScores/EUR/todo.txt);do
qsub -N $(echo PRS_job$(($n+50))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${TEDS_output_dir}/PolygenicScores/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR
n=$(($n+1))
done

3.1.2.2 Dense p-value thresholding

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

###
# Split the EUR TEDS individuals into 10K batches
###
# Temporary files would require a large amount of storage for all TEDS individuals
# Splitting the data into batches avoids this issue
mkdir /users/k1806347/brc_scratch/Data/TEDS/PolygenicScores_dense
split -l 1000 -d ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep ${TEDS_output_dir}/PolygenicScores_dense/TEDS.w_hm3.AllAncestry.EUR.keep.Batch_

# Calculate polygenic scores for for Europeans only for now.
n=0
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
for batch in $(ls ${TEDS_output_dir}/PolygenicScores_dense/TEDS.w_hm3.AllAncestry.EUR.keep.Batch_* | cut -d. -f6-);do
qsub -N $(echo PRS_job$(($n+50))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
  --target_keep ${TEDS_output_dir}/PolygenicScores_dense/TEDS.w_hm3.AllAncestry.EUR.keep.${batch} \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_dense/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_dense/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/${batch}/TEDS.w_hm3.${gwas}.EUR.${batch}
n=$(($n+1))
done
done

# Check all the batches finished correctly There should be 11 files
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
ls ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/*/TEDS.w_hm3.${gwas}.EUR.*.profiles | wc -l
done

# Combine the batches into a single file
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
head -1 ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/Batch_00/TEDS.w_hm3.${gwas}.EUR.Batch_00.profiles > ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR.profiles
tail -n +2 -q ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/*/TEDS.w_hm3.${gwas}.EUR.*.profiles >> ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR.profiles
done

# Delete per batch files and move log files into a single folder
for gwas in $(echo EDUC03 ADHD04 BODY11);do
mkdir ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/log_files
mv ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/*/TEDS.w_hm3.${gwas}.EUR.*.log ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/log_files/
rm -r ${TEDS_output_dir}/PolygenicScores_dense/EUR/${gwas}/Batch_*
done

3.1.3 23andMe

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

n=0
for individual in $(echo Robert_Plomin Esther_Anon Sophie_vonStumm); do

mkdir -p ${geno_23andMe}/${individual}/PolygenicScores/EUR

# Create a list of gwas that haven't got corresponding PRS files
> ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo.txt
for gwas in $(ls ${Geno_1KG_dir}/Score_files_for_poylygenic/ | sed '/todo.txt/d');do
if [ ! -f ${geno_23andMe}/${individual}/PolygenicScores/EUR/${gwas}/${individual}.w_hm3.${gwas}.EUR.profiles ]; then
echo $gwas >> ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo.txt
fi
done

# Calculate polygenic scores for for Europeans only for now.
for gwas in $(cat ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo.txt);do
qsub -N $(echo PRS_job$(($n+10))) -hold_jid $(echo PRS_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${geno_23andMe}/${individual}/PolygenicScores/EUR/${gwas}/${individual}.w_hm3.${gwas}.EUR
n=$(($n+1))
done
done

###
# Calculate PRS adjusted for PCs
###
# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a files containing all EUR and AllAncestry PCs for EUR individuals in the reference
module add general/R/3.5.0
R

library(data.table)

geno_23andMe<-'/users/k1806347/brc_scratch/Data/23andMe'

for(i in c('Oliver_Pain','Kaili_Rimfeld')){
  EUR<-fread(paste0(geno_23andMe,'/',i,'/Projected_PCs/EUR/',i,'.w_hm3.EUR.eigenvec'))
  names(EUR)[-1:-2]<-paste0('EUR_',names(EUR)[-1:-2])
  
  All<-fread(paste0(geno_23andMe,'/',i,'/Projected_PCs/AllAncestry/',i,'.w_hm3.AllAncestry.EUR.eigenvec'))
  names(All)[-1:-2]<-paste0('AllAncestry_',names(All)[-1:-2])
  
  PCs<-merge(EUR,All,by=c('FID','IID'))
  
  write.table(PCs,paste0(geno_23andMe,'/',i,'/Projected_PCs/',i,'.AllAncestry_EUR_PCs.EUR.eigenvec'), col.names=T, row.names=F, quote=F)
}

q()
n

for individual in $(echo Oliver_Pain Kaili_Rimfeld); do
for gwas in $(echo EDUC03 HEIG03 BODY11); do
qsub /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer/Scaled_polygenic_scorer.R \
  --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
  --target_covar ${geno_23andMe}/${individual}/Projected_PCs/${individual}.AllAncestry_EUR_PCs.EUR.eigenvec \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --covar_model ${Geno_1KG_dir}/Score_files_for_poylygenic/${gwas}/1KGPhase3.w_hm3.EUR_PC_resid.${gwas}.models.rds \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${geno_23andMe}/${individual}/PolygenicScores/EUR/${gwas}/${individual}.w_hm3.${gwas}.EUR.PCadjusted
done
done

module add general/R/3.5.0
module add compilers/gcc/8.1.0
module add general/python/2.7.10
R

opt$target_plink_chr<-'/users/k1806347/brc_scratch/Data/23andMe/Oliver_Pain/Oliver_Pain.1KGphase3.hm3.chr'
opt$target_covar<-'/users/k1806347/brc_scratch/Data/23andMe/Oliver_Pain/Projected_PCs/Oliver_Pain.AllAncestry_EUR_PCs.EUR.eigenvec'
opt$ref_score<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/EDUC03/1KGPhase3.w_hm3.EUR_PC_resid.EDUC03'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/EDUC03/1KGPhase3.w_hm3.EUR_PC_resid.EDUC03.EUR.scale'
opt$ref_freq_chr<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr'
opt$covar_model<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic/EDUC03/1KGPhase3.w_hm3.EUR_PC_resid.EDUC03.models.rds'
opt$plink<-'/users/k1806347/brc_scratch/Software/plink1.9.sh'
opt$pheno_name<-'EDUC03'
opt$output<-'/users/k1806347/brc_scratch/Data/23andMe/Oliver_Pain/PolygenicScores/EUR/EDUC03/Oliver_Pain.w_hm3.EDUC03.EUR.PCadjusted'

3.2 lassosum

This section uses a script called ‘Scaled_polygenic_scorer_lassosum.R’. See more information on the utility of this script here for more information on this script.

3.2.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Here we only calculate scores relevent to the phenotyps that we will use for comparing methods

for gwas in $(echo DEPR06);do
sbatch -n 10 --mem 5G -p brc,shared /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_lassosum/Scaled_polygenic_scorer_lassosum.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --n_cores 10 \
  --output ${UKBB_output}/PolygenicScores_lassosum/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR
done

module add apps/R/3.6.0

opt$target_plink_chr<-'/users/k1806347/brc_scratch/Data/UKBB/Genotype/mini_test/UKBB.w_hm3.QCd.AllSNP.mini_test.chr'
opt$target_keep<-'/users/k1806347/brc_scratch/Data/UKBB/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep'
opt$ref_score<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic_lassosum/DEPR06/1KGPhase3.w_hm3.DEPR06'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic_lassosum/DEPR06/1KGPhase3.w_hm3.DEPR06.EUR.scale'
opt$ref_freq_chr<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr'
opt$plink<-'/users/k1806347/brc_scratch/Software/plink1.9/plink'
opt$pheno_name<-'DEPR06'
opt$n_cores<-2
opt$output<-'/users/k1806347/brc_scratch/Data/UKBB/PolygenicScores_lassosum/EUR/DEPR06/UKBB.w_hm3.DEPR06.EUR'

3.2.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Here we only calculate scores relevent to the phenotyps that we will use for comparing methods
n=0
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
qsub -N $(echo PRS_job$(($n+10))) -hold_jid $(echo PRS_job$(($n+0))) -pe smp 6 -l h_vmem=10G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_lassosum/Scaled_polygenic_scorer_lassosum.R \
  --target_plink_chr /users/k1806347/brc_scratch/Data/TEDS/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep /users/k1806347/brc_scratch/Data/TEDS/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
  --ref_model ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.unvalidated.model.RDS \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --pheno_name ${gwas} \
  --n_cores 6 \
  --output /users/k1806347/brc_scratch/Data/TEDS/PolygenicScores_lassosum/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR
n=$(($n+1))
done

3.2.3 23andMe

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for individual in $(echo Kaili_Rimfeld); do

# Here we only calculate scores relevent to the phenotyps that we will use for comparing methods
n=0
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03);do
qsub -N $(echo PRS_job$(($n+10))) -hold_jid $(echo PRS_job$(($n+0))) -pe smp 1 -l h_vmem=5G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_lassosum/Scaled_polygenic_scorer_lassosum.R \
  --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
  --target_keep ${geno_23andMe}/${individual}/Projected_PCs/AllAncestry/${individual}.w_hm3.AllAncestry.model_pred.EUR.keep \
  --ref_model ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.unvalidated.model.RDS \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_lassosum/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --pheno_name ${gwas} \
  --n_cores 1 \
  --output ${geno_23andMe}/${individual}/PolygenicScores_lassosum/EUR/${gwas}/${individual}.w_hm3.${gwas}.EUR
n=$(($n+1))
done

done

module add general/R/3.5.0
module add compilers/gcc/8.1.0
module add general/python/2.7.10
R

opt$target_plink_chr<-'/users/k1806347/brc_scratch/Data/23andMe/Kaili_Rimfeld/Kaili_Rimfeld.1KGphase3.hm3.chr'
opt$target_keep<-'/users/k1806347/brc_scratch/Data/23andMe/Kaili_Rimfeld/Projected_PCs/AllAncestry/Kaili_Rimfeld.w_hm3.AllAncestry.model_pred.EUR.keep'
opt$ref_model<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic_lassosum/DEPR06/1KGPhase3.w_hm3.DEPR06.unvalidated.model.RDS'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_poylygenic_lassosum/DEPR06/1KGPhase3.w_hm3.DEPR06.EUR.scale'
opt$pheno_name<-'DEPR06'
opt$n_cores<-1
opt$output<-'/users/k1806347/brc_scratch/Data/23andMe/Kaili_Rimfeld/PolygenicScores_lassosum/EUR/DEPR06/Kaili_Rimfeld.w_hm3.DEPR06.EUR'


# Note. lassosum scoring failed with one individual.

3.3 PRScs

This section uses a script called ‘Scaled_polygenic_scorer_PRScs.R’. See more information on the utility of this script here for more information on this script.

3.3.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a list of gwas that haven't got corresponding PRS files in UKBB
> ${UKBB_output}/PolygenicScores/EUR/todo_PRScs.txt
#for gwas in $(cut -f 1 -d ' ' ~/brc_scratch/Data/GWAS_sumstats/Largest_GWAS_over15K_withoutUKBB.txt);do
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01 CROH01 SCLE02 RHEU01);do
if [ ! -f ${UKBB_output}/PolygenicScores_PRScs/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.PRScs_profiles ]; then
echo $gwas >> ${UKBB_output}/PolygenicScores/EUR/todo_PRScs.txt
fi
done

n=0
for gwas in $(cat ${UKBB_output}/PolygenicScores/EUR/todo_PRScs.txt);do
qsub -N $(echo PRScs_job$(($n+25))) -hold_jid $(echo PRScs_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_PRScs/Scaled_polygenic_scorer_PRScs.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores_PRScs/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR
n=$(($n+1))
done

3.3.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a list of gwas that haven't got corresponding PRS files in TEDS
> ${TEDS_output_dir}/PolygenicScores/EUR/todo_PRScs.txt
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
if [ ! -f ${TEDS_output_dir}/PolygenicScores_PRScs/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR.PRScs_profiles ]; then
echo $gwas >> ${TEDS_output_dir}/PolygenicScores/EUR/todo_PRScs.txt
fi
done

n=0
for gwas in $(cat ${TEDS_output_dir}/PolygenicScores/EUR/todo_PRScs.txt);do
qsub -N $(echo PRScs_job$(($n+25))) -hold_jid $(echo PRScs_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_PRScs/Scaled_polygenic_scorer_PRScs.R \
  --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${TEDS_output_dir}/PolygenicScores_PRScs/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR
n=$(($n+1))
done

3.3.3 23andMe

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for individual in $(echo Oliver_Pain Kaili_Rimfeld Robert_Plomin); do

mkdir -p ${geno_23andMe}/${individual}/PolygenicScores_PRScs/EUR

# Create a list of gwas that haven't got corresponding PRS files in UKBB
> ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo_PRScs.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03);do
if [ ! -f ${geno_23andMe}/${individual}/PolygenicScores_PRScs/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.PRScs_profile ]; then
echo $gwas >> ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo_PRScs.txt
fi
done

n=0
for gwas in $(cat ${geno_23andMe}/${individual}/PolygenicScores/EUR/todo_PRScs.txt);do
qsub -N $(echo PRScs_job$(($n+25))) -hold_jid $(echo PRScs_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_PRScs/Scaled_polygenic_scorer_PRScs.R \
  --target_plink_chr ${geno_23andMe}/${individual}/${individual}.1KGphase3.hm3.chr \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas} \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_PRScs/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${geno_23andMe}/${individual}/PolygenicScores_PRScs/EUR/${gwas}/${individual}.w_hm3.${gwas}.EUR
n=$(($n+1))
done

done

3.4 SBLUP

This section uses a script called ‘Scaled_polygenic_scorer_PRScs.R’. See more information on the utility of this script here for more information on this script.

3.4.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

mkdir -p ${UKBB_output}/PolygenicScores_SBLUP/EUR

# Create a list of gwas that haven't got corresponding PRS files in UKBB
> ${UKBB_output}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt
for gwas in $(echo DEPR06 COLL01 HEIG03 BODY03 DIAB05 COAD01 CROH01 SCLE02 RHEU01);do
if [ ! -f ${UKBB_output}/PolygenicScores_SBLUP/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR.SBLUP_profiles ]; then
echo $gwas >> ${UKBB_output}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt
fi
done

n=0
for gwas in $(cat ${UKBB_output}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt);do
qsub -N $(echo SBLUP_job$(($n+25))) -hold_jid $(echo SBLUP_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_SBLUP/Scaled_polygenic_scorer_SBLUP.R \
  --target_plink_chr ${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --target_keep ${UKBB_output}/Projected_PCs/AllAncestry/UKBB.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_SBLUP/${gwas}/GWAS_sumstats_SBLUP.sblup.cojo \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_SBLUP/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${UKBB_output}/PolygenicScores_SBLUP/EUR/${gwas}/UKBB.w_hm3.${gwas}.EUR
n=$(($n+1))
done

3.4.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

mkdir -p ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR

# Create a list of gwas that haven't got corresponding PRS files in TEDS
> ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
if [ ! -f ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR.SBLUP_profiles ]; then
echo $gwas >> ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt
fi
done

n=0
for gwas in $(cat ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR/todo_SBLUP.txt);do
qsub -N $(echo SBLUP_job$(($n+25))) -hold_jid $(echo SBLUP_job$(($n+0))) /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/Scaled_polygenic_scorer_SBLUP/Scaled_polygenic_scorer_SBLUP.R \
  --target_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
    --target_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
  --ref_score ${Geno_1KG_dir}/Score_files_for_poylygenic_SBLUP/${gwas}/GWAS_sumstats_SBLUP.sblup.cojo \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_poylygenic_SBLUP/${gwas}/1KGPhase3.w_hm3.${gwas}.EUR.scale \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr \
  --plink ${plink1_9} \
  --pheno_name ${gwas} \
  --output ${TEDS_output_dir}/PolygenicScores_SBLUP/EUR/${gwas}/TEDS.w_hm3.${gwas}.EUR
n=$(($n+1))
done

4 Functionally-informed polygenic scoring

Here we calculate functionally-informed poylgenic scores for a range of outcomes in the target samples based on previously prepared reference scoring files, and scale the polygenic scores based on scores in a reference sample with matched ancestry. For more information on preparing reference files for polygenic scoring, see here.

First we need to predict the functional genomic features in the target sample, then we calculate the functionally-informed polygenic scores by combining this information with previously prepared score files.

4.1 Predicting functional genomic features

Here we predict functional genomic features into the target samples. Currently the pipeline only includes features representing gene expression levels. The multiple SNP-predictors of gene expression have been downloaded from the FUSION website, and converted into the correct format for predicting functional features in a target samples. For more information see here.

This section uses a script called ‘FUSION_targ_scorer.R’. See more information on the utility of this script here for more information on this script.

4.1.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for pop in $(echo AFR AMR EAS EUR SAS);do
# Create file listing weights that haven't been predicted in the reference.
mkdir -p ${UKBB_output}/Predicted_expression/FUSION/${pop}
> ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt
for weights in $(cat ~/brc_scratch/Data/TWAS_sumstats/FUSION/snp_weight_list.txt);do
if [ ! -f ${UKBB_output}/Predicted_expression/FUSION/${pop}/${weights}/UKBB.w_hm3.QCd.AllSNP.FUSION.${weights}.predictions.gz ]; then
echo $weights >> ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt
fi
done
done

for pop in $(echo AFR AMR EAS EUR SAS);do
mkdir ${UKBB_output}/Predicted_expression/FUSION/${pop}
cat <<EOF >${UKBB_output}/Predicted_expression/FUSION/${pop}/todo_array.sh
#!/bin/bash
#SBATCH -p shared,brc
#SBATCH --mem 40G
#SBATCH -n 5

. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

weights=\$(awk -v myvar=\${SLURM_ARRAY_TASK_ID} 'NR == myvar {print}' ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt)
echo \${SLURM_ARRAY_TASK_ID}
echo \$weights

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/FUSION_targ_scorer/FUSION_targ_scorer.R \
  --targ_plink_chr \${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --targ_keep \${UKBB_output}/Phenotype/${pop}.QC.UpdateIDs.keep \
  --weights \${FUSION_dir}/SNP-weights/\${weights}/\${weights}.pos \
  --plink \${plink1_9} \
  --n_cores 5 \
  --memory 40000 \
  --score_files \${FUSION_dir}/SCORE_FILES \
  --ref_scale \${Geno_1KG_dir}/Predicted_expression/FUSION/\${weights}/1KGPhase3.w_hm3.FUSION.\${weights}.${pop}.scale \
  --pigz \${pigz_binary} \
  --output \${UKBB_output}/Predicted_expression/FUSION/${pop}/\${weights}/UKBB.w_hm3.QCd.AllSNP.FUSION.\${weights} \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/${pop}/1KGPhase3.w_hm3.${pop}.chr

EOF
done

for pop in $(echo AMR EAS SAS);do
sbatch --array=1-$(wc -l ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt | cut -d ' ' -f 1)%5 ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo_array.sh
done

######
# Predict gene expression into EUR_test subset for diverse ancestry analysis
######

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

for pop in $(echo EUR_test);do
# Create file listing weights that haven't been predicted in the reference.
mkdir -p ${UKBB_output}/Predicted_expression/FUSION/${pop}
> ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt
for weights in $(cat ~/brc_scratch/Data/TWAS_sumstats/FUSION/snp_weight_list.txt);do
if [ ! -f ${UKBB_output}/Predicted_expression/FUSION/${pop}/${weights}/UKBB.w_hm3.QCd.AllSNP.FUSION.${weights}.predictions.gz ]; then
echo $weights >> ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt
fi
done
done

for pop in $(echo EUR_test);do
cat <<EOF >${UKBB_output}/Predicted_expression/FUSION/${pop}/todo_array.sh
#!/bin/bash
#SBATCH -p shared,brc
#SBATCH --mem 40G
#SBATCH -n 5

. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

weights=\$(awk -v myvar=\${SLURM_ARRAY_TASK_ID} 'NR == myvar {print}' ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt)
echo \${SLURM_ARRAY_TASK_ID}
echo \$weights

/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/FUSION_targ_scorer/FUSION_targ_scorer.R \
  --targ_plink_chr \${UKBB_output}/Genotype/Harmonised/UKBB.w_hm3.QCd.AllSNP.chr \
  --targ_keep /users/k1806347/brc_scratch/Misc/Toshiki/Phenotypes/EUR_test.keep \
  --weights \${FUSION_dir}/SNP-weights/\${weights}/\${weights}.pos \
  --plink \${plink1_9} \
  --n_cores 5 \
  --memory 40000 \
  --score_files \${FUSION_dir}/SCORE_FILES \
  --ref_scale \${Geno_1KG_dir}/Predicted_expression/FUSION/\${weights}/1KGPhase3.w_hm3.FUSION.\${weights}.EUR.scale \
  --pigz \${pigz_binary} \
  --output \${UKBB_output}/Predicted_expression/FUSION/${pop}/\${weights}/UKBB.w_hm3.QCd.AllSNP.FUSION.\${weights} \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr

EOF
done

for pop in $(echo EUR_test);do
sbatch --array=1-$(wc -l ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo.txt | cut -d ' ' -f 1)%5 ${UKBB_output}/Predicted_expression/FUSION/${pop}/todo_array.sh
done

4.1.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create file listing weights that haven't been predicted in the reference.
mkdir -p ${TEDS_output_dir}/Predicted_expression/FUSION/EUR
> ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/todo.txt
for weights in $(cat ~/brc_scratch/Data/TWAS_sumstats/FUSION/snp_weight_list.txt);do
if [ ! -f ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/${weights}/TEDS.w_hm3.QCd.AllSNP.FUSION.${weights}.predictions.gz ]; then
echo $weights >> ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/todo.txt
fi
done

n=0
for weights in $(cat ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/todo.txt);do
qsub -N $(echo Predjob_$(($n+20))) -hold_jid $(echo Predjob_$(($n+0))) -pe smp 3 -l h_vmem=10G /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/FUSION_targ_scorer/FUSION_targ_scorer.R \
  --targ_plink_chr ${TEDS_output_dir}/Genotype/Harmonised/TEDS.w_hm3.QCd.AllSNP.chr \
  --targ_keep ${TEDS_output_dir}/Projected_PCs/Ancestry_idenitfier/TEDS.w_hm3.AllAncestry.EUR.keep \
  --weights ${FUSION_dir}/SNP-weights/${weights}/${weights}.pos \
  --plink ${plink1_9} \
  --n_cores 3 \
  --memory 40000 \
  --score_files ${FUSION_dir}/SCORE_FILES \
  --ref_scale ${Geno_1KG_dir}/Predicted_expression/FUSION/${weights}/1KGPhase3.w_hm3.FUSION.${weights}.EUR.scale \
  --pigz ${pigz_binary} \
  --output ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/${weights}/TEDS.w_hm3.QCd.AllSNP.FUSION.${weights} \
  --ref_freq_chr ${Geno_1KG_dir}/freq_files/EUR/1KGPhase3.w_hm3.EUR.chr
n=$(($n+1))
done

4.2 Calculating functionally-informed polygenic scores

Here we use the predicted functional genomic features and previously prepared reference score files to calculate functionally informed polygenic scores in the target samples. For more information see here.

This section uses a script called ‘Scaled_polygenic_scorer.R’. See more information on the utility of this script here for more information on this script.

4.2.1 UK Biobank

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config

# Create a list of gwas-tissue combinations that haven't been calculated.
> ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/todo.txt
for gwas in $(cat ${TWAS_rep}/Largest_TWAS_over15K_withoutUKBB_75comp.txt);do
for weights in $(cat ~/brc_scratch/Data/TWAS_sumstats/FUSION/snp_weight_list.txt);do
if [ ! -f ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/${weights}/UKBB.w_hm3.EUR.${weights}.${gwas}.fiprofile ]; then
echo $gwas $weights >> ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/todo.txt
fi
done
done

n=0
for i in $(seq 1 $(wc -l ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/todo.txt | cut -d ' ' -f 1));do
gwas=$(awk -v var="$i" 'NR == var {print $1}' ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/todo.txt)
weights=$(awk -v var="$i" 'NR == var {print $2}' ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/todo.txt)

qsub -N $(echo GeRS_job$(($n+10))) -hold_jid $(echo GeRS_job$(($n+0))) -l h_vmem=4G -pe smp 5 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_functionally_informed_risk_scorer/scaled_functionally_informed_risk_scorer.R \
  --targ_feature_pred ${UKBB_output}/Predicted_expression/FUSION/EUR/${weights}/UKBB.w_hm3.QCd.AllSNP.FUSION.${weights}.predictions.gz \
  --ref_score ${Geno_1KG_dir}/Score_files_for_functionally_informed_risk_scores/${gwas}/1KGPhase3.w_hm3.EUR.FUSION.${gwas}.${weights}.score \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_functionally_informed_risk_scores/${gwas}/1KGPhase3.w_hm3.EUR.FUSION.${gwas}.${weights}.scale \
  --pheno_name ${gwas} \
  --n_cores 5 \
  --pigz ${pigz_binary} \
  --output ${UKBB_output}/FunctionallyInformedPolygenicScores/EUR/${weights}/UKBB.w_hm3.EUR.${weights}.${gwas}

n=$(($n+1))
done

4.2.2 TEDS

See code

# Set required variables
. /users/k1806347/brc_scratch/Software/MyGit/GenoPred/config_used/Target_scoring.config
mkdir -p ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR

# Create a list of gwas-tissue combinations that haven't been calculated.
> ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/todo.txt
# for gwas in $(cat ${TWAS_rep}/Largest_TWAS_over15K_withoutTEDS_75comp.txt);do
for gwas in $(echo HEIG03 EDUC03 ADHD04 BODY11);do
for weights in $(cat ~/brc_scratch/Data/TWAS_sumstats/FUSION/snp_weight_list.txt);do
if [ ! -f ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/${weights}/TEDS.w_hm3.EUR.${weights}.${gwas}.fiprofile ]; then
echo $gwas $weights >> ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/todo.txt
fi
done
done

n=0
for i in $(seq 1 $(wc -l ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/todo.txt | cut -d ' ' -f 1));do
gwas=$(awk -v var="$i" 'NR == var {print $1}' ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/todo.txt)
weights=$(awk -v var="$i" 'NR == var {print $2}' ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/todo.txt)

qsub -N $(echo GeRS_job$(($n+10))) -hold_jid $(echo GeRS_job$(($n+0))) -l h_vmem=4G -pe smp 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/GenoPred/Scripts/scaled_functionally_informed_risk_scorer/scaled_functionally_informed_risk_scorer.R \
  --targ_feature_pred ${TEDS_output_dir}/Predicted_expression/FUSION/EUR/${weights}/TEDS.w_hm3.QCd.AllSNP.FUSION.${weights}.predictions.gz \
  --ref_score ${Geno_1KG_dir}/Score_files_for_functionally_informed_risk_scores/${gwas}/1KGPhase3.w_hm3.EUR.FUSION.${gwas}.${weights}.score \
  --ref_scale ${Geno_1KG_dir}/Score_files_for_functionally_informed_risk_scores/${gwas}/1KGPhase3.w_hm3.EUR.FUSION.${gwas}.${weights}.scale \
  --pheno_name ${gwas} \
  --n_cores 1 \
  --pigz ${pigz_binary} \
  --output ${TEDS_output_dir}/FunctionallyInformedPolygenicScores/EUR/${weights}/TEDS.w_hm3.EUR.${weights}.${gwas}

module add general/R/3.5.0
module add compilers/gcc/8.1.0
module add general/python/2.7.10
R

opt$targ_feature_pred<-'/users/k1806347/brc_scratch/Data/TEDS/Predicted_expression/FUSION/EUR/Adipose_Subcutaneous/TEDS.w_hm3.QCd.AllSNP.FUSION.Adipose_Subcutaneous.predictions.gz'
opt$ref_score<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_functionally_informed_risk_scores/HEIG03/1KGPhase3.w_hm3.EUR.FUSION.HEIG03.Adipose_Subcutaneous.score'
opt$ref_scale<-'/users/k1806347/brc_scratch/Data/1KG/Phase3/Score_files_for_functionally_informed_risk_scores/HEIG03/1KGPhase3.w_hm3.EUR.FUSION.HEIG03.Adipose_Subcutaneous.scale'
opt$pheno_name<-'HEIG03'
opt$n_cores<-1
opt$pigz<-'/users/k1806347/brc_scratch/Data/TEDS'
opt$output<-'/users/k1806347/brc_scratch/Data/TEDS/FunctionallyInformedPolygenicScores/EUR/Adipose_Subcutaneous/TEDS.w_hm3.EUR.Adipose_Subcutaneous.HEIG03'

n=$(($n+1))
done

Genotype-based scoring in target samples

1 Harmonisation with the reference

1.1 UK Biobank

1.2 TEDS

1.3 23andMe

1.3.1 Format and impute

2 Ancestry scoring

2.1 UK Biobank

2.2 TEDS

2.3 23andMe

3 Polygenic scoring

3.1 pT + clump

3.1.1 UK Biobank

3.1.1.1 Sparse p-value thresholding

3.1.1.2 Dense p-value thresholding

3.1.2 TEDS

3.1.2.1 Sparse p-value thresholding

3.1.2.2 Dense p-value thresholding

3.1.3 23andMe

3.2 lassosum

3.2.1 UK Biobank

3.2.2 TEDS

3.2.3 23andMe

3.3 PRScs

3.3.1 UK Biobank

3.3.2 TEDS

3.3.3 23andMe

3.4 SBLUP

3.4.1 UK Biobank

3.4.2 TEDS

4 Functionally-informed polygenic scoring

4.1 Predicting functional genomic features

4.1.1 UK Biobank

4.1.2 TEDS

4.2 Calculating functionally-informed polygenic scores

4.2.1 UK Biobank

4.2.2 TEDS