This document describes how to run the GenoPred pipeline offline (an environment without access to the internet).

First, the dependencies of the pipeline will need to prepared within an environment that does have access to the internet. These include the software container and additional pipeline resources. Then these resources can be transferred to the offline environment, and the pipeline can be run.

Below, I provide an example of this process.

Download software container

The GenoPred image has already been built using the Dockerfile here. This image is hosted on dockerhub and the singularity library.

# Docker
docker \
  pull \
  opaino/genopred_pipeline:latest

# Singularity
singularity \
  pull \
  --arch \
  amd64 \
  /users/k1806347/oliverpainfel/Software/singularity/genopred_pipeline_latest.sif \
  library://opain/genopred/genopred_pipeline:latest

Downloading pipeline resources

The resources required by the pipeline depends on the analyses requested by the user. I have provided rules to download required resources for two scenarios:

get_key_resources: Allows for most PGS methods (dbslmm, lassosum, megaprs, ptclump)
get_all_resources: Allows for all PGS methods (additionally incl. ldpred2, prscs, prscsx, quickprs, sbayesr, sbayesrc, xwing)

Alternatively, the user can download only the data required for ldpred2, prscs, prscsx, quickprs, sbayesr, sbayesrc, xwing:

get_ldpred2_resources: Allows for ldpred2
get_prscs_resources: Allows for prscs
get_prscsx_resources: Allows for prscsx
get_quickprs_resources: Allows for quickprs
get_sbayesr_resources: Allows for sbayesr
get_sbayesrc_resources: Allows for sbayesrc
get_xwing_resources: Allows for xwing

Note. 23andMe format target data will only be allowed for if the download_impute2_data rule is run, as reference data for imputation is required.

In this example, I will run the pipeline using the example configuration with the test data (running ptclump, dbslmm, and lassosum), so I will use the get_key_resources rule to download the required resources

# Create a configuration file specifying directory to save the required resources
echo "resdir: genopred_resources" > config_offline.yaml

# Run GenoPred pipeline using the get_key_resources rule
cd /users/k1806347/oliverpainfel/Software/MyGit/GenoPred/pipeline
conda activate genopred
snakemake \
  --profile slurm \
  --configfile=/users/k1806347/oliverpainfel/test/offline_example/config.yaml \
  get_key_resources

Download test data

For demonstration purposes, we will use the test data for the GenoPred pipeline. This is the same data as is used for the main pipeline tutorial page.

cd /users/k1806347/oliverpainfel/test/offline_example

# Download from Zenodo
wget -O test_data.tar.gz https://zenodo.org/records/10640650/files/test_data.tar.gz?download=1

# Decompress
tar -xf test_data.tar.gz

# Once decompressed, delete compressed version to save space
rm test_data.tar.gz

Run pipeline offline

I will now start an interactive session in the downloaded container. I will mount a folder within the container so I can read and write files outside of the container. See docker and singularity websites for general documentation on their use.

######
# Start interactive session within the container
######
# When using singularity or docker, we must mount folders we want to access within the container

# Singularity 
singularity shell \
  --bind /scratch/prj/oliverpainfel:/scratch/prj/oliverpainfel \
  --writable-tmpfs \
  /users/k1806347/oliverpainfel/Software/singularity/genopred_pipeline_latest.sif

# Docker
docker run \
  -it \
  -v /users/k1806347/oliverpainfel:/users/k1806347/oliverpainfel \
  genopred_pipeline:v0.4

Once the container has been started, we can use the pipeline as normal. A previously downloaded version of the GenoPred repo will be in the folder /tools/GenoPred, and the genopred conda environment will already be available.

In this example, before running the pipeline, I first get the configuration set up to the use the test data, the example configuration files, and run in an offline environment. First I create a symbolic link to the previously downloaded test_data, to align with the example_configuration files. Then, I update the example configuration to run in an offline environment:

Update the resdir parameter configfile to use the previously downloaded resources
Update the score_list to only include locally stored score files

# Activate GenoPred environment
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred

# Create symbolic for test_data inside the pipeline folder to work with the example configuration files
cd /tools/GenoPred/pipeline
ln -s /users/k1806347/oliverpainfel/test/offline_example/test_data ./test_data

# Update configuration files to run offline
# 1. Remove score files requiring direct access to PGS catalogue
awk -F' ' '$2 != "NA" {print}' example_input/score_list.txt > example_input/score_list_2.txt && mv example_input/score_list_2.txt example_input/score_list.txt
# 2. Update resdir to previously downloaded resources
echo "resdir: /scratch/prj/oliverpainfel/test/offline_example/genopred_resources" >> example_input/config.yaml

# Do a dry run to check the scheduled steps are expected (there should not be any steps saying 'download', and it should not be necessary to build the conda environment)
snakemake -n --use-conda --configfile=example_input/config.yaml output_all

# Run pipeline using test data and example configuration
snakemake -j4 --use-conda --configfile=example_input/config.yaml output_all

Any questions?

Please post questions as an issue on the GenoPred GitHub repo here.

GenoPred Pipeline - Running Offline

Download software container

Downloading pipeline resources

Download test data

Run pipeline offline

Any questions?