The UK Biobank (UKB) has recently updated its data access policy, requiring researchers to access and analyze UKB data exclusively via the UKB Research Analysis Platform (UKB-RAP), hosted by DNAnexus. This represents a significant shift in workflows, moving from institutional servers to a cloud-computing environment. DNAnexus and similar cloud-computing systems are likely to become standard for future datasets, so this guide aims to provide instructions on running GenoPred on DNAnexus, with a particular focus on analyzing UKB data. The UKB dataset presents unique challenges due to its size, making efficient analysis essential.
Cloud computing involves requesting access to an instance (or virtual machine) with a specified amount of resources, such as disk space, RAM, and the number of CPU cores. Once access is granted, users must create the required software and data environment within the instance—a step that is often unfamiliar to many. After completing the desired analyses, users must export any outputs they wish to retain, as any data or software left on the instance will be deleted once it is terminated.
This process of transferring software and data into and out of the instance is a significant departure from the experience of working on personal computers or institutional servers, where you can resume work seamlessly each time you log in.
Currently, the easiest way to run GenoPred on DNAnexus is interactively, using either RStudio or a Cloud Workstation with the GenoPred software container. Each approach offers unique advantages.
RStudio mounts files directly from a DNAnexus project, eliminating the need to import data. This feature saves time and disk space, particularly with large datasets like UKB genetic data.
Unlike RStudio, the cloud workstation requires input data to be manually imported using dx download, which can be time-consuming and requires additional storage space. While dxfuse allows project folders to be mounted (even across projects), it is prone to instability during long analyses, such as when using UKB.
Another distinction is that cloud workstations can be connected to via VScode, which some people may prefer to Rstudio.
Given the advantage of stable project folder mounting in RStudio, I
will demonstrate the workflow in that context. For the demo, minimal
resources are required, so I will request an
mem1_ssd1_v2_x2
instance type. A similar workflow can also
be applied in the Cloud Workstation, substituting the Singularity
container with the GenoPred Docker container.
To make the demonstration as similar as possible to working with UKB data, I will first upload a version of the GenoPred test data into a DNAnexus project. While it would be simpler to download the test data directly into the instance, this approach will better mimic the set up when working with UKB data.
# Step 1: Download the GenoPred test data from Zenodo
wget -O test_data.tar.gz https://zenodo.org/records/10640650/files/test_data.tar.gz?download=1
# Step 2: Decompress the downloaded file
tar -xf test_data.tar.gz
# Step 3: Extract only the necessary files for the demonstration
mkdir genopred_test_data
mv test_data/target/imputed_sample_plink2/example.chr22* genopred_test_data/
mv test_data/reference/gwas_sumstats/BODY04.gz genopred_test_data/
# Step 4: Load the Conda environment with dxpy installed
# The Conda environment file for installing dxpy is available in the GenoPred repository
# Path: GenoPred/pipeline/misc/dnanexus/dxpy_env.yml
conda activate dxpy_env
# Step 5: Log in to DNAnexus using an API token, if not already logged in
# API tokens can be created on the DNAnexus website (https://platform.dnanexus.com/)
dx login --token <token>
# Step 6: Select the desired DNAnexus project
dx select genopred_demo
# Step 7: Upload the prepared data to DNAnexus
dx upload -r genopred_test_data
# Step 8: Clean up temporary files to save space
rm -r test_data test_data.tar.gz genopred_test_data
Once your RStudio session has started, install Singularity via the terminal. Follow the commands below to set up Singularity, along with other essential tools:
# Configure keyboard layout (optional step to avoid prompts)
echo 'keyboard-configuration keyboard-configuration/layout select us' | sudo debconf-set-selections
echo 'keyboard-configuration keyboard-configuration/variant select English (US)' | sudo debconf-set-selections
# Install required dependencies (e.g., tmux, build tools, and libraries for Singularity)
sudo DEBIAN_FRONTEND=noninteractive apt update && \
sudo DEBIAN_FRONTEND=noninteractive apt install -y build-essential libseccomp-dev pkg-config squashfs-tools cryptsetup golang tmux
# Set the Singularity version to install
export VERSION=3.11.0
# Download and extract the Singularity source code
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz
tar -xvzf singularity-ce-${VERSION}.tar.gz
cd singularity-ce-${VERSION}
# Configure and build Singularity without SUID
./mconfig --without-suid
make -C builddir
# Install Singularity system-wide
sudo make -C builddir install
# Return to the home directory
cd ~
Note: This will need to be done for every new instance of Rstudio that you want to run GenoPred. For convenience, you can store the code in a shell script.
After installing Singularity, download the GenoPred container using the following command:
# Pull the GenoPred container
singularity pull library://opain/genopred/genopred_pipeline:latest
Note: To make your workflow more efficient and reproducible, consider saving the downloaded container file in your project folder. This allows you to re-use the container in future analyses without needing to download it again.
In a DNAnexus project, data is automatically mounted within the RStudio session and can be accessed from the /mnt/project directory. However, if the data you need is not located in the project folder, you will need to import it manually using the dx download command.
For this demonstration, we are working with the small test dataset, so I will show how to import it. When working with larger datasets, such as UKB genetic data, it is often more efficient to use the mounted version of the dispensed data. This approach saves time and disk space by avoiding the need to download and store large files within the RStudio instance.
# View the files in your current project folder
dx ls
# More instructions on using the dx commands can be found here:
# https://documentation.dnanexus.com/getting-started/cli-quickstart
# Import the test data into the instance
dx download -r genopred_test_data
Now, we will create the configuration files required to run the
GenoPred pipeline. Note that the outdir
and
resdir
must be set to directories located outside the
container to ensure proper access and storage.
# Create directories for configuration and output files
dir.create('/home/rstudio-server/genopred/config', recursive = TRUE)
dir.create('/home/rstudio-server/genopred/output', recursive = TRUE)
# Create gwas_list configuration
gwas_list <- data.frame(
name = 'BODY04',
path = '/home/rstudio-server/genopred_test_data/BODY04.gz',
population = 'EUR',
n = NA,
sampling = NA,
prevalence = NA,
mean = 0,
sd = 1,
label = '"Body Mass Index"'
)
write.table(
gwas_list,
'/home/rstudio-server/genopred/config/gwas_list.txt',
col.names = TRUE,
row.names = FALSE,
quote = FALSE,
sep = ' '
)
# Create target_list configuration
target_list <- data.frame(
name = 'example_plink1',
path = '/home/rstudio-server/genopred_test_data/example',
type = 'plink2',
indiv_report = FALSE
)
write.table(
target_list,
'/home/rstudio-server/genopred/config/target_list.txt',
col.names = TRUE,
row.names = FALSE,
quote = FALSE,
sep = ' '
)
# Create main configuration file
conf <- c(
'outdir: /home/rstudio-server/genopred/output',
'resdir: /home/rstudio-server/genopred/resources',
'config_file: /home/rstudio-server/genopred/config/config.yaml',
'gwas_list: /home/rstudio-server/genopred/config/gwas_list.txt',
'target_list: /home/rstudio-server/genopred/config/target_list.txt',
"pgs_methods: ['ptclump']",
'testing: chr22'
)
writeLines(
conf,
'/home/rstudio-server/genopred/config/config.yaml'
)
Now we can run the GenoPred pipeline. This can be done either interactively within the container or by executing the desired commands directly. Running the pipeline interactively is often more convenient for performing a dry run before launching the full analysis.
To ensure that your analysis persists even when the RStudio server tab is closed, it is recommended to start the container within a tmux session. This will allow you to detach and reattach to the session as needed.
Start a tmux session within the terminal by running:
tmux
This will take you into a tmux
session. You can ‘detach’
from the tmux
session by pressing Ctrl+b d
,
and reattach to the session in the future by typing:
tmux attach
Further instructions on using tmux
can be found here.
To begin, start an interactive session within the Singularity container. Make sure to mount the home directory within the RStudio session to store the outputs:
singularity shell \
--bind /home/rstudio-server:/home/rstudio-server \
--writable-tmpfs \
/home/rstudio-server/genopred_pipeline_latest.sif
Once inside the container, you can use the GenoPred pipeline as usual:
# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline
# Perform a Dry Run:
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all
# Run the Pipeline:
# Once satisfied with the dry run, execute the pipeline:
snakemake -j1 --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all
While this analysis is running, you can detach from the
tmux
session, close the RStudio tab, and close your
browser. When you reopen the RStudio app, you may see that your session
appears suspended. Do not worry—your analysis will continue running in
the background.
By using tmux
, your analysis will continue to run even
if the terminal session or RStudio server tab is closed.
After the analysis is complete, you can leave the container by typing:
exit
To avoid losing the outputs of your analysis when the RStudio session is terminated, you need to export the results to your DNAnexus project folder. Both the resdir (resources) and outdir (outputs) should be saved for future analyses.
For simplicity and efficiency, we will compress the outputs and resources into a single tar file and then upload it to the DNAnexus project. If you plan to reuse the resources (e.g., for different pipeline configurations), you may choose to store them in separate tar files.
# Compress the GenoPred working directory
cd /home/rstudio-server
tar -cvf test_run_genopred.tar genopred
# Upload the GenoPred container
dx upload genopred_pipeline_latest.sif
# Upload the tar file containing pipeline resources and outputs
dx upload test_run_genopred.tar
Once the files are uploaded, you can safely terminate the RStudio session. Ensure the session is fully terminated by checking the Monitor tab in your DNAnexus project folder.
If you want to extend your analysis without rerunning steps that have already completed, you can start a new RStudio session, import the outputs from a previous run, and resume the pipeline from within the container. Note that you will also need to import the input data used in the previous analysis.
# Download the Outputs from the Previous Run:
dx download test_run_genopred.tar
tar -xvf test_run_genopred.tar
# Download the Singularity Container:
dx download genopred_pipeline_latest.sif
# Download the Input Data:
dx download -r genopred_test_data
When using dx download for files that are not part of a tar archive, the original timestamps are lost. This may confuse GenoPred, as it will interpret the files as being updated. To fix this, reset the timestamps of the input files:
find /home/rstudio-server/genopred_test_data/ -type f -exec touch -t 200001010101.01 {} +
If the input data is accessed via the automatic mount in /mnt/project, this step is unnecessary, as the timestamps are preserved. This another advantage of mounting the input data.
# Start an Interactive Session in the Container:
singularity shell \
--bind /home/rstudio-server:/home/rstudio-server \
--writable-tmpfs \
/home/rstudio-server/genopred_pipeline_latest.sif
# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline
### Check the Pipeline State: Run a dry run to verify the current state of the pipeline:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all
The pipeline will indicate that there is nothing to be done if the configuration has not changed and all outputs are up to date.
If an input file is updated (e.g., by changing its timestamp), the pipeline will automatically rerun only the necessary steps:
touch /home/rstudio-server/genopred_test_data/BODY04.gz
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/config.yaml output_all
This dry run will show which steps need to be re-executed due to the update.
The UKB imputed genetic data is provided without post-imputation QC,
resulting in large files. The format_target
step of
GenoPred, which reformats the target genetic data, is time-intensive but
reduces file size to ~86GB, making future analyses faster and cheaper.
Storing this output for reuse is highly recommended. Additionally,
selecting instances with appropriate resources for each pipeline stage
ensures cost efficiency, as some steps utilize multiple cores while
others do not.
format_target
: Use a
mem3_ssd2_v2_x8
instance, which supports 8 processes and
provides sufficient RAM and disk space for processing the large UKB
files efficiently. Other steps of the pipeline will have other resources
requirements.format_target
on a High-Resource Instance: Use the
mem3_ssd2_v2_x8
instance to efficiently complete the
format_target
step.format_target
, terminate the instance to minimize
costs.Once your RStudio session has started, install Singularity via the terminal. We will use the shell script we created here:
# Download, update permissions and run script to install singularity
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity
We will use a mounted version of the UKB genetic data to save time and disk space. To meet GenoPred’s file name requirements without duplicating data, we will create symbolic links. Given the size of the UKB genetic data, we will also request an instance with sufficient resources.
# Create symlinks to the dispensed imputed genetic data
mkdir -p /home/rstudio-server/ukb/ukb_symlinks
# Link bgen and bgen.bgi files for all chromosomes
for chr in $(seq 1 22); do
for file in $(echo bgen bgen.bgi); do
ln -s /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype/ukb22828_c${chr}_b0_v3.${file} /home/rstudio-server/ukb/ukb_symlinks/ukb_imp.chr${chr}.${file}
done
done
# Link the sample file (same for all chromosomes)
ln -s /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype/ukb22828_c1_b0_v3.sample /home/rstudio-server/ukb/ukb_symlinks/ukb_imp.sample
Next, create the configuration files required to run the GenoPred
pipeline. Ensure that outdir
and resdir
are
set to directories outside the container for proper access and
storage.
# Create directories for configuration and output files
dir.create('/home/rstudio-server/genopred/config/ukb/basic', recursive = T)
dir.create('/home/rstudio-server/genopred/output', recursive = T)
# Create target list
# We are specifying the symbolic links we made for the UKB data
target_list <- data.frame(
name='ukb',
path='/home/rstudio-server/ukb/ukb_symlinks/ukb_imp',
type='bgen',
indiv_report=F
)
write.table(
target_list,
'/home/rstudio-server/genopred/config/ukb/basic/target_list.txt',
col.names = T,
row.names = F,
quote = F
)
# Create config file
conf <- c(
'outdir: /home/rstudio-server/genopred/output',
'config_file: /home/rstudio-server/genopred/config/ukb/basic/config.yaml',
'resdir: /home/rstudio-server/genopred/resources',
'target_list: /home/rstudio-server/genopred/config/ukb/basic/target_list.txt'
)
write.table(
conf,
'/home/rstudio-server/genopred/config/ukb/basic/config.yaml',
col.names = F,
row.names = F,
quote = F
)
Now we can run the GenoPred pipeline. This can be done either interactively within the container or by executing the desired commands directly. Running the pipeline interactively is often more convenient for performing a dry run before launching the full analysis.
To ensure that your analysis persists even when the RStudio server tab is closed, it is recommended to start the container within a tmux session. This will allow you to detach and reattach to the session as needed.
Start a tmux session within the terminal by running:
tmux
This will take you into a tmux
session. You can ‘detach’
from the tmux
session by pressing Ctrl+b d
,
and reattach to the session in the future by typing:
tmux attach
Further instructions on using tmux
can be found here.
To begin, start an interactive session within the Singularity container. Make sure to mount the home directory within the RStudio session and the directory that the symbolic links point to, to store the pipeline outputs and access the input data within the container.
singularity shell \
--bind /home/rstudio-server:/home/rstudio-server \
--bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
--writable-tmpfs \
/mnt/project/genopred_pipeline_latest.sif
Once inside the container, you can use the GenoPred pipeline as
usual. The resources provided by this instance
(mem3_ssd2_v2_x8
) will not be required for all steps in the
GenoPred workflow, so to be cost efficient, I am just carrying out the
format_target
step within this instance. I will then export
the data, terminate this instance, continue my analysis using a new
instance with appropriate resources for downstream steps.
# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline
# Perform a Dry Run:
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues:
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml format_target
# Run the Pipeline:
# Once satisfied with the dry run, execute the pipeline:
snakemake -j8 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml format_target
Note: Since the instance has 8 cores available, I
use the -j8
parameter when running snakemake
to ensure it utilizes all 8 cores efficiently.
After the analysis is complete, you can leave the container by typing:
exit
In total, this analysis took ~18 hours, costing ~£6.
To avoid losing the outputs of your analysis when the RStudio session is terminated, you need to export the results to your DNAnexus project folder. Both the resdir (resources) and outdir (outputs) should be saved for future analyses.
For simplicity and efficiency, we will compress the outputs and resources into a single tar file and then upload it to the DNAnexus project. If you plan to reuse the resources (e.g., for different pipeline configurations), you may choose to store them in separate tar files.
# Compress and upload the GenoPred working directory
cd /home/rstudio-server
tar -cvf ukb_genopred.tar genopred
dx upload ukb_genopred.tar
I will also tar and upload the symlinks created for the UKB data. While these could be recreated, this approach conveniently preserves the timestamps, making it easier to resume the analysis seamlessly when extending it in the future.
# Compress and upload the ukb directory containing symlinks
cd /home/rstudio-server
tar -cvf ukb_symlinks.tar ukb
dx upload ukb_symlinks.tar
Once the files are uploaded, you can safely terminate the RStudio session. Ensure the session is fully terminated by checking the Monitor tab in your DNAnexus project folder.
The ancestry inference step is required prior to polygenic scoring, so we will do this now. In the same session, we will also perform the within-sample QC and project reference principal components, which generate other useful outputs.
Neither of the these steps require much RAM. The within-sample QC can
leverage multiple cores, but ancestry inference doesn’t. We need enough
disk space to import the output from the previous run, but not much
more. In this demonstration I am using a mem2_ssd1_v2_x8
instance, which seemed to work well.
Once your RStudio session has started, install Singularity via the terminal. We will use the shell script we created here:
# Download, update permissions and run script to install singularity
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity
We are going to extend our previous analysis UKB using GenoPred. We need to recreate the environment we had before.
# Download and decompress the symlinks previous run of GenoPred
# Decompressing the mounted data save time and disk space
tar -xvf /mnt/project/ukb_symlinks.tar -C ~/
tar -xvf /mnt/project/ukb_genopred.tar -C ~/
# Start a tmux session to ensure the analysis persists even the connection is lost
tmux
# Start an interactive session inside the GenoPred container
singularity shell \
--bind /home/rstudio-server:/home/rstudio-server \
--bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
--writable-tmpfs \
/mnt/project/genopred_pipeline_latest.sif
# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline
# Perform a Dry Run:
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues
# We can see that GenoPred will pick up where it left off, and won't rerun steps it ran before.
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml ancestry_inference outlier_detection pc_projection
# Run the analysis. Here I am using 8 cores since I am using an instance with 8 cores available (mem2_ssd1_v2_x8).
snakemake -j8 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/basic/config.yaml ancestry_inference outlier_detection pc_projection
Once the analysis is complete, we will compress and
dx upload
the output. The genopred
output
folder will contain the contents of the previous run as well (the
reformatted UKB data), so we can delete the old version of the
ukb_genopred.tar
file in our project folder after the
upload of the new version is complete. The same goes for the
ukb
/ukb_symlinks.tar
folder - We need to
reupload this since we are now storing the list of unrelated individuals
in there.
# Compress and upload the GenoPred working directory
cd /home/rstudio-server
tar -cvf ukb_genopred.tar genopred
dx upload ukb_genopred.tar
# Compress and upload the ukb directory containing symlinks and list of unrelated individuals
tar -cvf ukb_symlinks.tar ukb
dx upload ukb_symlinks.tar
Score files can be generated using GenoPred either on DNAnexus or on other platforms, as this step does not require access to UK Biobank (UKB) data.
Notably, score files generated in one instance of GenoPred (or with other software) can be reused as input for another instance of GenoPred. For example, you can:
Generate score files using GenoPred on an institutional server (e.g., for free or with existing resources).
Copy these score files to DNAnexus and use them to perform target sample scoring in the UKB dataset on DNAnexus.
There are already several demonstrations of running GenoPred on DNAnexus in this document. The same setup can be used to generate score files. So, I will focus on demonstrating the more common scenario of importing PGS score files from a previous run of GenoPred, to be used for target sample scoring in UKB on DNAnexus.
There are two approaches for using scores files from a different run
of GenoPred. The score file can be reformated to the PGS catalogue
format, and included in the score_list
, but this requires
one set of weights per file, which is inefficient, and looses some
functionality downstream such as the function to return the
pseudovalidated score (find_pseudo()
). An alternative
solution, is to copy the input GWAS sumstats, QC’d sumstats, and the PGS
score files, which provides full downstream functionality.
The most convenient solution will depend on your needs. If you just
want to use a relatively fast PGS method, then you might as well run on
DNAnexus as it won’t cost much. If you want to use computationally
intensive methods, then you may want to save money by running on your
institutional server, and then importing the output to DNAnexus. If you
only want to use a single score from the computationally intensive
method, then exporting that score alone and specifying it using the
score_list
will be most convenient. However, if you want
full functionality of GenoPred, whilst running PGS methods on a
different server, then copying the entire GWAS and score file
directories from a previous run onto DNAnexus is needed. I will
demonstrate only the final scenario as it is the most convoluted.
Here I will use a score file generated using GenoPred previously. The
score file was generated using the an coronary artery disease GWAS and
the ptclump
method. I have uploaded it to my DNAnexus
project from my institutional server using the dx upload function
(similar to I did in this
section).
# Package and upload the required sumstats and score files to the DNAnexus project folder.
mkdir -p genopred_scores/gwas_sumstat
mkdir -p genopred_scores/pgs_score_files/sbayesr
cp ~/oliverpainfel/GenoPred/pipeline/example_input/gwas_list.txt genopred_scores/
cp -r ~/oliverpainfel/GenoPred/pipeline/test_data/output/test1/reference/gwas_sumstat/COAD01 genopred_scores/gwas_sumstat/
cp -r ~/oliverpainfel/GenoPred/pipeline/test_data/output/test1/reference/pgs_score_files/ptclump/COAD01 genopred_scores/pgs_score_files/ptclump/
tar -cvf genopred_scores.tar genopred_scores
dx upload genopred_scores.tar
Now I will spin up a new instance in Rstudio to perform target sample
scoring in UKB. Using mem2_ssd1_v2_x4
instance.
dx download install_singularity.sh
chmod a+x install_singularity.sh
./install_singularity.sh
# Import GenoPred inputs relating to UKB
tar -xvf /mnt/project/ukb_symlinks.tar -C ~/
tar -xvf /mnt/project/ukb_genopred.tar -C ~/
# Import GenoPred outputs from PGS methods, and move into the apropriate genopred folder
tar -xvf /mnt/project/genopred_scores.tar -C ~/
mv genopred_scores/gwas_sumstat ~/genopred/output/reference/
mv genopred_scores/pgs_score_files ~/genopred/output/reference/
The gwas_list and pgs_methods should match the configuration used to generate the score files. However, we will need to create empty files to represent the original raw GWAS summary statistics, which we did not copy over to DNAnexus.
# Make an empty file to represent the unQC'd sumstats
library(data.table)
gwas_list<-fread('~/genopred_scores/gwas_list.txt')
gwas_list<-gwas_list[gwas_list$name == 'COAD01',]
dir.create('/home/rstudio-server/raw_sumstats')
for(i in 1:nrow(gwas_list)){
path <- paste0('/home/rstudio-server/raw_sumstats/', gwas_list$name[i],'.txt')
file.create(path)
gwas_list$path[i] <- path
}
gwas_list$label<-paste0("\"", gwas_list$label, "\"")
dir.create('/home/rstudio-server/genopred/config/ukb/demo')
write.table(
gwas_list,
'/home/rstudio-server/genopred/config/ukb/demo/gwas_list.txt',
col.names = T,
row.names = F,
quote = F
)
# Create config file
conf <- c(
'outdir: /home/rstudio-server/genopred/output',
'config_file: /home/rstudio-server/genopred/config/ukb/demo/config.yaml',
'resdir: /home/rstudio-server/genopred/resources',
'gwas_list: /home/rstudio-server/genopred/config/ukb/demo/gwas_list.txt',
"pgs_methods: ['ptclump']",
'target_list: /home/rstudio-server/genopred/config/ukb/basic/target_list.txt',
'cores_target_pgs: 1'
)
write.table(
conf,
'/home/rstudio-server/genopred/config/ukb/demo/config.yaml',
col.names = F,
row.names = F,
quote = F
)
# Start a tmux session
tmux
# Start interactive session in container
singularity shell \
--bind /home/rstudio-server:/home/rstudio-server \
--bind /mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype:/mnt/project/Bulk/Imputation/UKB\ imputation\ from\ genotype \
--writable-tmpfs \
/mnt/project/genopred_pipeline_latest.sif
# Activate the GenoPred Environment:
source /opt/mambaforge/etc/profile.d/conda.sh
conda activate genopred
# Navigate to the Pipeline Folder:
cd /tools/GenoPred/pipeline
# It will think the score files need to be recreated due to the sumstat paths changing. So touch the outputs of prep_pgs
# This just updates the file timestamps for step prior to prep_pgs so the pipeline doesn't think it needs to recreate them due to the raw sumstats being newer than the score files.
snakemake -t -j1 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml prep_pgs
# Perform a Dry Run:
# A dry run checks the pipeline's steps without executing them, helping you identify any missing dependencies or issues.
snakemake -n --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml output_all
# We can see it will only run target scoring and downstream steps, as it should.
# Now we can run using four cores, matching the resources available in our instance (mem2_ssd1_v2_x4)
snakemake -j4 --use-conda --configfile=/home/rstudio-server/genopred/config/ukb/demo/config.yaml output_all
You could tar and export the entire genopred
folder
again, or you could export just the files you need, such as the score
files and the report.
# Upload report
dx upload genopred/output/ukb/reports/ukb-report.html
It could also be convenient to store the PGS in an .RDS file, and then export that file.
export LC_ALL=C
export LANG=C
setwd('/tools/GenoPred/pipeline')
library(data.table)
source('../functions/misc.R')
source_all('../functions')
# Read in PGS
pgs <- read_pgs(config = '/home/rstudio-server/genopred/config/ukb/demo/config.yaml')
saveRDS(pgs, file = "/home/rstudio-server/ukb_pgs_COAD01.rds")
dx upload ukb_pgs_COAD01.rds
Please post questions as an issue on the GenoPred GitHub repo here.