This document describes how to run the GenoPred pipeline offline (an environment without access to the internet).
First, the dependencies of the pipeline will need to prepared within an environment that does have access to the internet. These include the software container and additional pipeline resources. Then these resources can be transferred to the offline environment, and the pipeline can be run.
Below, I provide an example of this process.
The GenoPred image has already been built using the Dockerfile here. This image is hosted on dockerhub and the singularity library.
# Docker
docker \
pull \
opaino/genopred_pipeline:latest
# Singularity
singularity \
pull \
--arch \
amd64 \
/users/k1806347/oliverpainfel/Software/singularity/genopred_pipeline_latest.sif \
library://opain/genopred/genopred_pipeline:latest
The resources required by the pipeline depends on the analyses requested by the user. I have provided rules to download required resources for two scenarios:
get_key_resources
: Allows for most PGS
methods (dbslmm
, lassosum
,
megaprs
, ptclump
)get_all_resources
: Allows for all PGS
methods (additionally incl. ldpred2
, prscs
,
sbayesr
)Alternatively, the user can download only the data required for
ldpred2
, prscs
, sbayesr
:
get_ldpred2_resources
: Allows for
ldpred2get_prscs_resources
: Allows for
prscsget_sbayesr_resources
: Allows for
sbayesrNote. 23andMe
format target data will
only be allowed for if the download_impute2_data
rule is
run, as reference data for imputation is required.
In this example, I will run the pipeline using the example
configuration with the test data (running ptclump
,
dbslmm
, and lassosum
), so I will use the
get_key_resources
rule to download the required
resources
# Create a configuration file specifying directory to save the required resources
echo "resdir: genopred_resources" > config_offline.yaml
# Run GenoPred pipeline using the get_key_resources rule
cd /users/k1806347/oliverpainfel/Software/MyGit/GenoPred/pipeline
conda activate genopred
snakemake \
--profile slurm \
--configfile=/users/k1806347/oliverpainfel/test/offline_example/config.yaml \
get_key_resources
For demonstration purposes, we will use the test data for the GenoPred pipeline. This is the same data as is used for the main pipeline tutorial page.
cd /users/k1806347/oliverpainfel/test/offline_example
# Download from Zenodo
wget -O test_data.tar.gz https://zenodo.org/records/10640650/files/test_data.tar.gz?download=1
# Decompress
tar -xf test_data.tar.gz
# Once decompressed, delete compressed version to save space
rm test_data.tar.gz
I will now start an interactive session in the downloaded container. I will mount a folder within the container so I can read and write files outside of the container. See docker and singularity websites for general documentation on their use.
######
# Start interactive session within the container
######
# When using singularity or docker, we must mount folders we want to access within the container
# Singularity
singularity shell \
--bind /scratch/prj/oliverpainfel:/scratch/prj/oliverpainfel \
--writable-tmpfs \
/users/k1806347/oliverpainfel/Software/singularity/genopred_pipeline_latest.sif
# Docker
docker run \
-it \
-v /users/k1806347/oliverpainfel:/users/k1806347/oliverpainfel \
genopred_pipeline:v0.4
Once the container has been started, we can use the pipeline as
normal. A previously downloaded version of the GenoPred repo will be in
the folder /tools/GenoPred
, and the genopred
conda environment will already be available.
In this example, before running the pipeline, I first get the configuration set up to the use the test data, the example configuration files, and run in an offline environment. First I create a symbolic link to the previously downloaded test_data, to align with the example_configuration files. Then, I update the example configuration to run in an offline environment:
resdir
parameter configfile to use the
previously downloaded resourcesscore_list
to only include locally stored
score files# Activate GenoPred environment
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Create symbolic for test_data inside the pipeline folder to work with the example configuration files
cd /tools/GenoPred/pipeline
ln -s /users/k1806347/oliverpainfel/test/offline_example/test_data ./test_data
# Update configuration files to run offline
# 1. Remove score files requiring direct access to PGS catalogue
awk -F' ' '$2 != "NA" {print}' example_input/score_list.txt > example_input/score_list_2.txt && mv example_input/score_list_2.txt example_input/score_list.txt
# 2. Update resdir to previously downloaded resources
echo "resdir: /scratch/prj/oliverpainfel/test/offline_example/genopred_resources" >> example_input/config.yaml
# Do a dry run to check the scheduled steps are expected (there should not be any steps saying 'download', and it should not be necessary to build the conda environment)
snakemake -n --use-conda --configfile=example_input/config.yaml output_all
# Run pipeline using test data and example configuration
snakemake -j4 --use-conda --configfile=example_input/config.yaml output_all
Please post questions as an issue on the GenoPred GitHub repo here.