Introduction

The UK Biobank (UKB) has recently updated its data access policy, requiring researchers to access and analyze UKB data exclusively via the UKB Research Analysis Platform (UKB-RAP), hosted by DNAnexus. This represents a significant shift in workflows, moving from institutional servers to a cloud-computing environment. DNAnexus and similar cloud-computing systems are likely to become standard for future datasets, so this guide aims to provide instructions on running GenoPred on DNAnexus, with a particular focus on analyzing UKB data. The UKB dataset presents unique challenges due to its size, making efficient analysis essential.

Cloud computing involves requesting access to an instance (or virtual machine) with a specified amount of resources, such as disk space, RAM, and the number of CPU cores. Once access is granted, users must create the required software and data environment within the instance—a step that is often unfamiliar to many. After completing the desired analyses, users must export any outputs they wish to retain, as any data or software left on the instance will be deleted once it is terminated.

This process of transferring software and data into and out of the instance is a significant departure from the experience of working on personal computers or institutional servers, where you can resume work seamlessly each time you log in.


RStudio vs. Cloud Workstation

Currently, the easiest way to run GenoPred on DNAnexus is interactively, using either RStudio or a Cloud Workstation with the GenoPred software container. Each approach offers unique advantages.

RStudio mounts files directly from a DNAnexus project, eliminating the need to import data. This feature saves time and disk space, particularly with large datasets like UKB genetic data.

Unlike RStudio, the cloud workstation requires input data to be manually imported using dx download, which can be time-consuming and requires additional storage space. While dxfuse allows project folders to be mounted (even across projects), it is prone to instability during long analyses, such as when using UKB.

Another distinction is that cloud workstations can be connected to via VScode, which some people may prefer to Rstudio.

Given the advantage of stable project folder mounting in RStudio, I will demonstrate the workflow in that context. For the demo, minimal resources are required, so I will request an mem1_ssd1_v2_x2 instance type. A similar workflow can also be applied in the Cloud Workstation, substituting the Singularity container with the GenoPred Docker container.


Demonstration using Rstudio


Step 0: Prepare test data

To make the demonstration as similar as possible to working with UKB data, I will first upload a version of the GenoPred test data into a DNAnexus project. While it would be simpler to download the test data directly into the instance, this approach will better mimic the set up when working with UKB data.

# Step 1: Download the GenoPred test data from Zenodo
wget -O test_data.tar.gz https://zenodo.org/records/10640650/files/test_data.tar.gz?download=1

# Step 2: Decompress the downloaded file
tar -xf test_data.tar.gz

# Step 3: Extract only the necessary files for the demonstration
mkdir genopred_test_data
mv test_data/target/imputed_sample_plink2/example.chr22* genopred_test_data/
mv test_data/reference/gwas_sumstats/BODY04.gz genopred_test_data/

# Step 4: Load the Conda environment with dxpy installed
# The Conda environment file for installing dxpy is available in the GenoPred repository
# Path: GenoPred/pipeline/misc/dnanexus/dxpy_env.yml
conda activate dxpy_env

# Step 5: Log in to DNAnexus using an API token, if not already logged in
# API tokens can be created on the DNAnexus website (https://platform.dnanexus.com/)
dx login --token <token>

# Step 6: Select the desired DNAnexus project
dx select genopred_demo

# Step 7: Upload the prepared data to DNAnexus
dx upload -r genopred_test_data

# Step 8: Clean up temporary files to save space
rm -r test_data test_data.tar.gz genopred_test_data

Step 1. Install Singularity and Download the GenoPred Container

Once your RStudio session has started, install Singularity via the terminal. Follow the commands below to set up Singularity, along with other essential tools:

# Configure keyboard layout (optional step to avoid prompts)
echo 'keyboard-configuration keyboard-configuration/layout select us' | sudo debconf-set-selections
echo 'keyboard-configuration keyboard-configuration/variant select English (US)' | sudo debconf-set-selections

# Install required dependencies (e.g., tmux, build tools, and libraries for Singularity)
sudo DEBIAN_FRONTEND=noninteractive apt update && \
sudo DEBIAN_FRONTEND=noninteractive apt install -y build-essential libseccomp-dev pkg-config squashfs-tools cryptsetup golang tmux

# Set the Singularity version to install
export VERSION=3.11.0

# Download and extract the Singularity source code
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz
tar -xvzf singularity-ce-${VERSION}.tar.gz
cd singularity-ce-${VERSION}

# Configure and build Singularity without SUID
./mconfig --without-suid
make -C builddir

# Install Singularity system-wide
sudo make -C builddir install

# Return to the home directory
cd ~

Note: This will need to be done for every new instance of Rstudio that you want to run GenoPred. For convenience, you can store the code in a shell script.

After installing Singularity, download the GenoPred container using the following command:

# Pull the GenoPred container
singularity pull library://opain/genopred/genopred_pipeline:latest

Note: To make your workflow more efficient and reproducible, consider saving the downloaded container file in your project folder. This allows you to re-use the container in future analyses without needing to download it again.


Step 2. Import input data

In a DNAnexus project, data is automatically mounted within the RStudio session and can be accessed from the /mnt/project directory. However, if the data you need is not located in the project folder, you will need to import it manually using the dx download command.

For this demonstration, we are working with the small test dataset, so I will show how to import it. When working with larger datasets, such as UKB genetic data, it is often more efficient to use the mounted version of the dispensed data. This approach saves time and disk space by avoiding the need to download and store large files within the RStudio instance.

# View the files in your current project folder
dx ls

# More instructions on using the dx commands can be found here:
# https://documentation.dnanexus.com/getting-started/cli-quickstart

# Import the test data into the instance
dx download -r genopred_test_data

Step 3. Set up pipeline configuration

Now, we will create the configuration files required to run the GenoPred pipeline. Note that the outdir and resdir must be set to directories located outside the container to ensure proper access and storage.

# Create directories for configuration and output files
dir.create('/home/rstudio-server/genopred/config', recursive = TRUE)
dir.create('/home/rstudio-server/genopred/output', recursive = TRUE)

# Create gwas_list configuration
gwas_list <- data.frame(
  name = 'BODY04', 
  path = '/home/rstudio-server/genopred_test_data/BODY04.gz',
  population = 'EUR',
  n = NA,
  sampling = NA,
  prevalence = NA,
  mean = 0,
  sd = 1,
  label = '"Body Mass Index"'
)

write.table(
  gwas_list, 
  '/home/rstudio-server/genopred/config/gwas_list.txt', 
  col.names = TRUE, 
  row.names = FALSE, 
  quote = FALSE, 
  sep = ' '
)

# Create target_list configuration
target_list <- data.frame(
  name = 'example_plink1',
  path = '/home/rstudio-server/genopred_test_data/example',
  type = 'plink2',
  indiv_report = FALSE
)

write.table(
  target_list, 
  '/home/rstudio-server/genopred/config/target_list.txt', 
  col.names = TRUE, 
  row.names = FALSE, 
  quote = FALSE, 
  sep = ' '
)

# Create main configuration file
conf <- c(
  'outdir: /home/rstudio-server/genopred/output',
  'resdir: /home/rstudio-server/genopred/resources',
  'config_file: /home/rstudio-server/genopred/config/config.yaml',
  'gwas_list: /home/rstudio-server/genopred/config/gwas_list.txt',
  'target_list: /home/rstudio-server/genopred/config/target_list.txt',
  "pgs_methods: ['ptclump']",
  'testing: chr22'
)

writeLines(
  conf, 
  '/home/rstudio-server/genopred/config/config.yaml'
)

Step 4. Run the pipeline in the container

Now we can run the GenoPred pipeline. This can be done either interactively within the container or by executing the desired commands directly. Running the pipeline interactively is often more convenient for performing a dry run before launching the full analysis.


Start an Interactive Session in the Container

To ensure that your analysis persists even when the RStudio server tab is closed, it is recommended to start the container within a tmux session. This will allow you to detach and reattach to the session as needed.

Start a tmux session within the terminal by running:

tmux

This will take you into a tmux session. You can ‘detach’ from the tmux session by pressing Ctrl+b d, and reattach to the session in the future by typing:

tmux attach

Further instructions on using tmux can be found here.

To begin, start an interactive session within the Singularity container. Make sure to mount the home directory within the RStudio session to store the outputs:

singularity shell \
  --bind /home/rstudio-server:/home/rstudio-server \
  --writable-tmpfs \
  /home/rstudio-server/genopred_pipeline_latest.sif

Run GenoPred Inside the Container

Once inside the container, you can use the GenoPred pipeline as usual:

# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline

# Perform a Dry Run: 
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all

# Run the Pipeline: 
# Once satisfied with the dry run, execute the pipeline:
snakemake -j1 --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all

While this analysis is running, you can detach from the tmux session, close the RStudio tab, and close your browser. When you reopen the RStudio app, you may see that your session appears suspended. Do not worry—your analysis will continue running in the background.

By using tmux, your analysis will continue to run even if the terminal session or RStudio server tab is closed.

After the analysis is complete, you can leave the container by typing:

exit

Step 5. Export ouputs to the project folder

To avoid losing the outputs of your analysis when the RStudio session is terminated, you need to export the results to your DNAnexus project folder. Both the resdir (resources) and outdir (outputs) should be saved for future analyses.

For simplicity and efficiency, we will compress the outputs and resources into a single tar file and then upload it to the DNAnexus project. If you plan to reuse the resources (e.g., for different pipeline configurations), you may choose to store them in separate tar files.

# Compress the GenoPred working directory
cd /home/rstudio-server
tar -cvf test_run_genopred.tar genopred

# Upload the GenoPred container
dx upload genopred_pipeline_latest.sif

# Upload the tar file containing pipeline resources and outputs
dx upload test_run_genopred.tar

Once the files are uploaded, you can safely terminate the RStudio session. Ensure the session is fully terminated by checking the Monitor tab in your DNAnexus project folder.


Extending your analysis in the future

If you want to extend your analysis without rerunning steps that have already completed, you can start a new RStudio session, import the outputs from a previous run, and resume the pipeline from within the container. Note that you will also need to import the input data used in the previous analysis.

# Download the Outputs from the Previous Run:
dx download test_run_genopred.tar
tar -xvf test_run_genopred.tar

# Download the Singularity Container:
dx download genopred_pipeline_latest.sif

# Download the Input Data:
dx download -r genopred_test_data

When using dx download for files that are not part of a tar archive, the original timestamps are lost. This may confuse GenoPred, as it will interpret the files as being updated. To fix this, reset the timestamps of the input files:

find /home/rstudio-server/genopred_test_data/ -type f -exec touch -t 200001010101.01 {} +

If the input data is accessed via the automatic mount in /mnt/project, this step is unnecessary, as the timestamps are preserved. This another advantage of mounting the input data.

# Start an Interactive Session in the Container:
singularity shell \
  --bind /home/rstudio-server:/home/rstudio-server \
  --writable-tmpfs \
  /home/rstudio-server/genopred_pipeline_latest.sif

# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline

### Check the Pipeline State: Run a dry run to verify the current state of the pipeline:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all

The pipeline will indicate that there is nothing to be done if the configuration has not changed and all outputs are up to date.

If an input file is updated (e.g., by changing its timestamp), the pipeline will automatically rerun only the necessary steps:

touch /home/rstudio-server/genopred_test_data/BODY04.gz
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all

This dry run will show which steps need to be re-executed due to the update.


Running GenoPred with UK Biobank

The UKB imputed genetic data is provided without post-imputation QC, resulting in large files. The format_target step of GenoPred, which reformats the target genetic data, is time-intensive but reduces file size to ~86GB, making future analyses faster and cheaper. Storing this output for reuse is highly recommended. Additionally, selecting instances with appropriate resources for each pipeline stage ensures cost efficiency, as some steps utilize multiple cores while others do not.


Key Adjustments for UKB Data

  • Mounted UKB Data: Avoid duplicating the large UKB data files by using a mounted version to save time and disk space.
  • File Naming with Symbolic Links: Update file names as required by GenoPred using symbolic links to avoid duplicating data.
  • High-Resource Instance for format_target: Use a mem3_ssd2_v2_x8 instance, which supports 8 processes and provides sufficient RAM and disk space for processing the large UKB files efficiently. Other steps of the pipeline will have other resources requirements.

Workflow

  • Run format_target on a High-Resource Instance: Use the mem3_ssd2_v2_x8 instance to efficiently complete the format_target step.
  • Export and Store Processed Data: Save the reformatted data for future use, avoiding repeated processing.
  • Terminate the High-Resource Instance: After completing format_target, terminate the instance to minimize costs.
  • Continue Analysis on Cost-Effective Instances: For downstream steps, switch to an instance with resources tailored to those stages.
  • This approach balances efficiency and cost when working with large-scale UKB genetic data in the GenoPred pipeline.

Step 1. Install Singularity

Once your RStudio session has started, install Singularity via the terminal. We will use the shell script we created here:

# Download, update permissions and run script to install singularity
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity

Step 2. Prepare input data

We will use a mounted version of the UKB genetic data to save time and disk space. To meet GenoPred’s file name requirements without duplicating data, we will create symbolic links. Given the size of the UKB genetic data, we will also request an instance with sufficient resources.

# Create symlinks to the dispensed imputed genetic data
mkdir -p /home/rstudio-server/ukb/ukb_symlinks

# Link bgen and bgen.bgi files for all chromosomes
for chr in $(seq 1 22); do
  for file in $(echo bgen bgen.bgi); do
    ln -s /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype/ukb22828_c${chr}_b0_v3.${file} /home/rstudio-server/ukb/ukb_symlinks/ukb_imp.chr${chr}.${file}
  done
done

# Link the sample file (same for all chromosomes)
ln -s /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype/ukb22828_c1_b0_v3.sample /home/rstudio-server/ukb/ukb_symlinks/ukb_imp.sample

Step 3. Set up pipeline configuration

Next, create the configuration files required to run the GenoPred pipeline. Ensure that outdir and resdir are set to directories outside the container for proper access and storage.

# Create directories for configuration and output files
dir.create('/home/rstudio-server/genopred/config/ukb/basic', recursive = T)
dir.create('/home/rstudio-server/genopred/output', recursive = T)

# Create target list
# We are specifying the symbolic links we made for the UKB data
target_list <- data.frame(
  name='ukb',
  path='/home/rstudio-server/ukb/ukb_symlinks/ukb_imp',
  type='bgen',
  indiv_report=F
)

write.table(
  target_list,
  '/home/rstudio-server/genopred/config/ukb/basic/target_list.txt',
  col.names = T,
  row.names = F,
  quote = F
)

# Create config file
conf <- c(
  'outdir: /home/rstudio-server/genopred/output',
  'config_file: /home/rstudio-server/genopred/config/ukb/basic/config.yaml',
  'resdir: /home/rstudio-server/genopred/resources',
  'target_list: /home/rstudio-server/genopred/config/ukb/basic/target_list.txt'
)

write.table(
  conf,
  '/home/rstudio-server/genopred/config/ukb/basic/config.yaml',
  col.names = F,
  row.names = F,
  quote = F
)

Step 4. Run the pipeline in the container

Now we can run the GenoPred pipeline. This can be done either interactively within the container or by executing the desired commands directly. Running the pipeline interactively is often more convenient for performing a dry run before launching the full analysis.


Start an Interactive Session in the Container

To ensure that your analysis persists even when the RStudio server tab is closed, it is recommended to start the container within a tmux session. This will allow you to detach and reattach to the session as needed.

Start a tmux session within the terminal by running:

tmux

This will take you into a tmux session. You can ‘detach’ from the tmux session by pressing Ctrl+b d, and reattach to the session in the future by typing:

tmux attach

Further instructions on using tmux can be found here.

To begin, start an interactive session within the Singularity container. Make sure to mount the home directory within the RStudio session and the directory that the symbolic links point to, to store the pipeline outputs and access the input data within the container.

singularity shell \
  --bind /home/rstudio-server:/home/rstudio-server \
  --bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
  --writable-tmpfs \
  /mnt/project/genopred_pipeline_latest.sif

Run GenoPred Inside the Container

Once inside the container, you can use the GenoPred pipeline as usual. The resources provided by this instance (mem3_ssd2_v2_x8) will not be required for all steps in the GenoPred workflow, so to be cost efficient, I am just carrying out the format_target step within this instance. I will then export the data, terminate this instance, continue my analysis using a new instance with appropriate resources for downstream steps.

# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline

# Perform a Dry Run: 
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml format_target

# Run the Pipeline: 
# Once satisfied with the dry run, execute the pipeline:
snakemake -j8 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml format_target

Note: Since the instance has 8 cores available, I use the -j8 parameter when running snakemake to ensure it utilizes all 8 cores efficiently.

After the analysis is complete, you can leave the container by typing:

exit

In total, this analysis took ~18 hours, costing ~£6.


Step 5. Export ouputs to the project folder

To avoid losing the outputs of your analysis when the RStudio session is terminated, you need to export the results to your DNAnexus project folder. Both the resdir (resources) and outdir (outputs) should be saved for future analyses.

For simplicity and efficiency, we will compress the outputs and resources into a single tar file and then upload it to the DNAnexus project. If you plan to reuse the resources (e.g., for different pipeline configurations), you may choose to store them in separate tar files.

# Compress and upload the GenoPred working directory
cd /home/rstudio-server
tar -cvf ukb_genopred.tar genopred
dx upload ukb_genopred.tar

I will also tar and upload the symlinks created for the UKB data. While these could be recreated, this approach conveniently preserves the timestamps, making it easier to resume the analysis seamlessly when extending it in the future.

# Compress and upload the ukb directory containing symlinks
cd /home/rstudio-server
tar -cvf ukb_symlinks.tar ukb
dx upload ukb_symlinks.tar

Once the files are uploaded, you can safely terminate the RStudio session. Ensure the session is fully terminated by checking the Monitor tab in your DNAnexus project folder.


Step 6. Ancestry Inference and Within Target QC

The ancestry inference step is required prior to polygenic scoring, so we will do this now. In the same session, we will also perform the within-sample QC and project reference principal components, which generate other useful outputs.


Start an Rstudio session

Neither of the these steps require much RAM. The within-sample QC can leverage multiple cores, but ancestry inference doesn’t. We need enough disk space to import the output from the previous run, but not much more. In this demonstration I am using a mem2_ssd1_v2_x8 instance, which seemed to work well.


Install Singularity

Once your RStudio session has started, install Singularity via the terminal. We will use the shell script we created here:

# Download, update permissions and run script to install singularity
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity

Import the data from previous session

We are going to extend our previous analysis UKB using GenoPred. We need to recreate the environment we had before.

# Download and decompress the symlinks previous run of GenoPred
# Decompressing the mounted data save time and disk space
tar -xvf /mnt/project/ukb_symlinks.tar -C ~/
tar -xvf /mnt/project/ukb_genopred.tar -C ~/

Estimate relatedness

Although GenoPred can estimate relatedness from scratch, UKB is very large and it will be time consuming. Instead, we will use the kinship matrix released by UKBto identify a list of unrelated individuals, using the software GreedyRelated, a script I have written previously ukb_relative_remover.R.

# Install optparse in R
Rscript -e "install.packages('optparse', repos='https://cloud.r-project.org/')"

# Download and give permission to GreedyRelated binary for linux
wget https://gitlab.com/-/project/14754196/uploads/6fdc44072a8a866cb77c2cf91f68d662/GreedyRelated_linux
chmod a+x GreedyRelated_linux

# Identify list of related individuals (using threshold of 0.044)
Rscript ukb_relative_remover.R \
  --rel_file /mnt/project/Bulk/Genotype\ Results/Genotype\ calls/ukb_rel.dat \
  --rel_thresh 0.044 \
  --seed 1 \
  --GreedyRelated ./GreedyRelated_linux \
  --output ukb/ukb82087
# Identify a list of unrelated individuals by removing this list of related indivduals from the full list of individuals in the UKB .fam file
library(data.table)
related <- fread('ukb/ukb82087.related')$V1
fam<-fread('/mnt/project/Bulk/Genotype Results/Genotype calls/ukb22418_c1_b0_v2.fam')$V1

unrelated<-fam[!(fam %in% related)]
write.table(unrelated, 'ukb/ukb82087.unrelated', col.names=F, row.names = F, quote = F)

Update configuration for relatedness

We must update the target_list configuration file for GenoPred to indicate the location of the file indicating unrelated individuals.

target_list <- data.frame(
  name='ukb',
  path='/home/rstudio-server/ukb/ukb_symlinks/ukb_imp',
  type='bgen',
  indiv_report=F,
  unrel='/home/rstudio-server/ukb/ukb82087.unrelated'
)

write.table(
  target_list,
  '/home/rstudio-server/genopred/config/ukb/basic/target_list.txt',
  col.names = T,
  row.names = F,
  quote = F
)

Run pipeline

# Start a tmux session to ensure the analysis persists even the connection is lost
tmux

# Start an interactive session inside the GenoPred container
singularity shell \
  --bind /home/rstudio-server:/home/rstudio-server \
  --bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
  --writable-tmpfs \
  /mnt/project/genopred_pipeline_latest.sif

# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline

# Perform a Dry Run: 
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues
# We can see that GenoPred will pick up where it left off, and won't rerun steps it ran before.
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml ancestry_inference outlier_detection pc_projection

# Run the analysis. Here I am using 8 cores since I am using an instance with 8 cores available (mem2_ssd1_v2_x8).
snakemake -j8 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml ancestry_inference outlier_detection pc_projection

Package and export outputs

Once the analysis is complete, we will compress and dx upload the output. The genopred output folder will contain the contents of the previous run as well (the reformatted UKB data), so we can delete the old version of the ukb_genopred.tar file in our project folder after the upload of the new version is complete. The same goes for the ukb/ukb_symlinks.tar folder - We need to reupload this since we are now storing the list of unrelated individuals in there.

# Compress and upload the GenoPred working directory
cd /home/rstudio-server
tar -cvf ukb_genopred.tar genopred
dx upload ukb_genopred.tar

# Compress and upload the ukb directory containing symlinks and list of unrelated individuals
tar -cvf ukb_symlinks.tar ukb
dx upload ukb_symlinks.tar

Step 7. Generating score files

Score files can be generated using GenoPred either on DNAnexus or on other platforms, as this step does not require access to UK Biobank (UKB) data.

Notably, score files generated in one instance of GenoPred (or with other software) can be reused as input for another instance of GenoPred. For example, you can:

  1. Generate score files using GenoPred on an institutional server (e.g., for free or with existing resources).

  2. Copy these score files to DNAnexus and use them to perform target sample scoring in the UKB dataset on DNAnexus.

There are already several demonstrations of running GenoPred on DNAnexus in this document. The same setup can be used to generate score files. So, I will focus on demonstrating the more common scenario of importing PGS score files from a previous run of GenoPred, to be used for target sample scoring in UKB on DNAnexus.

There are two approaches for using scores files from a different run of GenoPred. The score file can be reformated to the PGS catalogue format, and included in the score_list, but this requires one set of weights per file, which is inefficient, and looses some functionality downstream such as the function to return the pseudovalidated score (find_pseudo()). An alternative solution, is to copy the input GWAS sumstats, QC’d sumstats, and the PGS score files, which provides full downstream functionality.

The most convenient solution will depend on your needs. If you just want to use a relatively fast PGS method, then you might as well run on DNAnexus as it won’t cost much. If you want to use computationally intensive methods, then you may want to save money by running on your institutional server, and then importing the output to DNAnexus. If you only want to use a single score from the computationally intensive method, then exporting that score alone and specifying it using the score_list will be most convenient. However, if you want full functionality of GenoPred, whilst running PGS methods on a different server, then copying the entire GWAS and score file directories from a previous run onto DNAnexus is needed. I will demonstrate only the final scenario as it is the most convoluted.


Step 8. Target sampling scoring

Here I will use a score file generated using GenoPred previously. The score file was generated using the an coronary artery disease GWAS and the ptclump method. I have uploaded it to my DNAnexus project from my institutional server using the dx upload function (similar to I did in this section).

# Package and upload the required sumstats and score files to the DNAnexus project folder.
mkdir -p genopred_scores/gwas_sumstat
mkdir -p genopred_scores/pgs_score_files/sbayesr

cp ~/oliverpainfel/GenoPred/pipeline/example_input/gwas_list.txt genopred_scores/
cp -r ~/oliverpainfel/GenoPred/pipeline/test_data/output/test1/reference/gwas_sumstat/COAD01 genopred_scores/gwas_sumstat/
cp -r ~/oliverpainfel/GenoPred/pipeline/test_data/output/test1/reference/pgs_score_files/ptclump/COAD01 genopred_scores/pgs_score_files/ptclump/

tar -cvf genopred_scores.tar genopred_scores
dx upload genopred_scores.tar

Now I will spin up a new instance in Rstudio to perform target sample scoring in UKB. Using mem2_ssd1_v2_x4 instance.

  • Install Singularity (same as before - I put it in a shell script to make it easier to run)
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity.sh
  • Import the data from previous session (Decompress from the mounted version to save disk space and time)
# Import GenoPred inputs relating to UKB
tar -xvf /mnt/project/ukb_symlinks.tar -C ~/
tar -xvf /mnt/project/ukb_genopred.tar -C ~/

# Import GenoPred outputs from PGS methods, and move into the apropriate genopred folder 
tar -xvf /mnt/project/genopred_scores.tar -C ~/
mv genopred_scores/gwas_sumstat ~/genopred/output/reference/
mv genopred_scores/pgs_score_files ~/genopred/output/reference/
  • Update configuration to specify the GWAS, score files and PGS methods to use.

The gwas_list and pgs_methods should match the configuration used to generate the score files. However, we will need to create empty files to represent the original raw GWAS summary statistics, which we did not copy over to DNAnexus.

# Make an empty file to represent the unQC'd sumstats
library(data.table)
gwas_list<-fread('~/genopred_scores/gwas_list.txt')
gwas_list<-gwas_list[gwas_list$name == 'COAD01',]
dir.create('/home/rstudio-server/raw_sumstats')
for(i in 1:nrow(gwas_list)){
  path <- paste0('/home/rstudio-server/raw_sumstats/', gwas_list$name[i],'.txt')
  file.create(path)
  gwas_list$path[i] <- path
}

gwas_list$label<-paste0("\"", gwas_list$label, "\"")

dir.create('/home/rstudio-server/genopred/config/ukb/demo')

write.table(
  gwas_list,
  '/home/rstudio-server/genopred/config/ukb/demo/gwas_list.txt',
  col.names = T,
  row.names = F,
  quote = F
)

# Create config file
conf <- c(
  'outdir: /home/rstudio-server/genopred/output',
  'config_file: /home/rstudio-server/genopred/config/ukb/demo/config.yaml',
  'resdir: /home/rstudio-server/genopred/resources',
  'gwas_list: /home/rstudio-server/genopred/config/ukb/demo/gwas_list.txt',
  "pgs_methods: ['ptclump']",
  'target_list: /home/rstudio-server/genopred/config/ukb/basic/target_list.txt',
  'cores_target_pgs: 1'
)

write.table(
  conf,
  '/home/rstudio-server/genopred/config/ukb/demo/config.yaml',
  col.names = F,
  row.names = F,
  quote = F
)
  • Run pipeline
# Start a tmux session
tmux

# Start interactive session in container
singularity shell \
  --bind /home/rstudio-server:/home/rstudio-server \
  --bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
  --writable-tmpfs \
  /mnt/project/genopred_pipeline_latest.sif

# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline

# It will think the score files need to be recreated due to the sumstat paths changing. So touch the outputs of prep_pgs
# This just updates the file timestamps for step prior to prep_pgs so the pipeline doesn't think it needs to recreate them due to the raw sumstats being newer than the score files.
snakemake -t -j1 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml prep_pgs

# Perform a Dry Run: 
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues.
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml output_all

# We can see it will only run target scoring and downstream steps, as it should.
# Now we can run using four cores, matching the resources available in our instance (mem2_ssd1_v2_x4)
snakemake -j4 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml output_all
  • Package and export PGS

You could tar and export the entire genopred folder again, or you could export just the files you need, such as the score files and the report.

# Upload report
dx upload genopred/output/ukb/reports/ukb-report.html

It could also be convenient to store the PGS in an .RDS file, and then export that file.

export LC_ALL=C
export LANG=C
setwd('/tools/GenoPred/pipeline')
library(data.table)

source('../functions/misc.R')
source_all('../functions')

# Read in PGS
pgs <- read_pgs(config = '/home/rstudio-server/genopred/config/ukb/demo/config.yaml')

saveRDS(pgs, file = "/home/rstudio-server/ukb_pgs_COAD01.rds")
dx upload ukb_pgs_COAD01.rds

Troubleshooting

Please post questions as an issue on the GenoPred GitHub repo here.