Abstract
This is a “rewrite” of the Rosetta
ligand_docking_tutorial.md
with comments on how to combine
Docker Container and macOS.
This is a “rewrite” of the Rosetta ligand_docking_tutorial.md
file in R-markdown with comments on how to combine Docker Container and
macOS. (There are no Windows binaries.)
Summary: Combine software and scripts on Docker and local macOS computer (Intel amd64 or arm64 Silicon Chip M series) to follow successfully the Rosetta tutorial Ligand Docking with a G-Protein Coupled Receptor. This method will allow access to the native OS speed while fulfilling all preparatory and exploratory steps that fail or are too complex to set-up on the local computer.
The Rosetta software suite includes algorithms for computational modeling and analysis of protein structures. […] including de novo protein design, enzyme design, ligand docking, and structure prediction of biological macromolecules and macromolecular complexes.
The software is rather complex and it can be difficult to “make things work” considering the breadth of options for algorithms or hardware and operating system support. This method will allow to access the native OS speed while fulfilling all preparatory and exploratory steps.
This post is an attempt to help users that want to use the software on their native OS (macOS) but since some functionality built-in the tutorials assumes a Linux OS, some of the steps are in fact easier handled on the Linux side thanks to Docker.
These instructions should benefit macOS users primarily as there are no Windows binaries available. Windows users should use the WSL2 method to install Linux under Windows.
Users should be somewhat familiar with bash command line (see e.g. my Survival Command Line tutorial) and have the Docker Desktop software installed (see e.g. my tutorial Docker – Beginner for Biologists.)
The Rosetta Docker image can be downloaded from
Terminal
with the command below, assuming
that Docker is already installed:
docker pull rosettacommons/rosetta:latest
While the purpose of the Docker image is to provide access to all the Linux compiled binaries, we will take advantage of some of the Linux functionality as well as the installed Python within.
Rosetta is freely available for academic and non-commercial purposes, under license. The software can be downloaded from the links provided on the Download page.
In order to compute the docking computation “natively” (for faster results) on the local computer users should download the newest “release”, e.g. from the Academic download page.
For this post I used Rosetta 3.14 for M1 (Silicon Chip “M1 binaries”, 13Gb) Macintosh, Intel-based Mac users should download the “Mac binaries” (14Gb).
Note: Unarchiving the file will require about 45 Gb of disk space but will contain the material for all tutorials and demos.
We will use 2 Terminal sessions: one to navigate within the Macintosh natively. The other to run a Docker container that will be activated in a way so that both Terminal sessions will share the same directory area on the local computer.
When Terminal
is running you can choose a different
color for each shell using the Shell > New Window
menu cascade. (Basic is white background.)
Here we’ll use a blue background for when we are looking within the Docker Container, and a pale yellow when we are on the macOS side:
echo pale yellow means macOS side
A pale yellow Terminal can be obtained with the menu cascade:
Shell > New Window > Man Page
echo light blue means we are running from inside the Linux container
A blue background (with white text) can be obtained with the menu
cascade:
Shell > New Window > Ocean
Note: the colored background will only be visible in the HTML versions. (PDF versions are better for printing.)
On the Macintosh side it will be useful to create environment
variables as suggested in the Rosetta Commons section How
To Read These Tutorials. Assuming that the binaries are found within
the Downloads
directory, we can keep that location and its
default name.
For the M1 series the unarchived directory was called
rosetta.binary.m1.release-371
and the following variables
below were created. I replaced my username by $USER
so that
these commands become generic and can be copied, with the caveat that
the binary name might be different (change accordingly!)
Open a Mac Terminal
(/Applications/Utilities/Terminal.app
) and paste the
(edited) commands:
export ROSETTA3=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/source
export ROSETTA3_DB=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/database
export ROSETTA_TOOLS=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/tools
export ROSETTA3_DEMOS=/Users/$USER/Downloads/rosetta.binary.m1.release-371/main/demos
These commands can me made “permanent” by including them within the
.zshrc
or .bashrc
files for the
zsh
or bash
shell choice respectively. (You
can know your shell with command echo $SHELL
)
Open a new Terminal and select a different color to better distinguish this Terminal with the Top menu cascade: Shell > New Window > Ocean
Navigate the the top level of the release directory within the main directory. For me it would be as follows…
cd /Users/$USER/Downloads/rosetta.binary.m1.release-371/main/
Note: the container is not yet running, hence this command is printed with a pale yellow background i.e. we are still on the macOS side.
Verify that this was successful with pwd and
then
continue.
Launch the Docker container: The -v
option lets us share
the current directory (i.e. main
, with the
container, mapping it as /data
within the container, and
making it the default working directory with the
-w
option.
This command is given to macOS. Once we are running, we’ll be in the “blue” session i.e. within the Linux OS.
docker run -it --rm -v ${PWD}:/data -w /data rosettacommons/rosetta
(Note: on a Silicon M series Mac Docker will complain about the “platform” but this can be ignored.)
A listing of the files and directories should reveal the same content as the main directory. We are now within the Linux container and therefore we’ll use blue background.
The shell prompt will be shown as #
preceded by the
working directory as /data #
which is omitted for easier
copy/paste of the commands.
ls -F
The result should be similar to the following:
CITING_ROSETTA.md README.md rosetta_scripts_scripts/
CLA.md database/ source/
CONTRIBUTING.md demos/ tests/
LICENSE.md documentation/ tools/
PyRosetta.notebooks/ pyrosetta_scripts/
We can check a couple more things: the shell as well as the operating system:
echo $SHELL
/bin/bash
cat /etc/os-release
Will show that we are running Ubuntu:
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Within the Docker container we don’t really need the
ROSETTA
environment variables as all paths can start with
/data
to be “absolute” (i.e. non ambiguous.)
The tutorial assumes that the user is sitting in front of the fully functional Linux computer, including graphical interface, with all ancillary software needed for such a computer already installed. However, within the container a few important utilities are missing, so we need to add them now.
The command whoami
will confirm that we are running as
root
as is indicated by the #
prompt. This
means that we can install any software we need. The first command
provides Ubuntu with information needed to know where
to download the software we ask to install. On the second line we ask to
install (yes by default with -y
) 3 software
Issue the following commands after the # prompt:
apt-get update
apt-get install -y wget nano pymol
With these installed we can run all of the commands in the tutorial.
Adding the pymol
Python module will allow the creation of
PyMOL .pse
files as well as a specific python script, as
described in the tutorial, from the Docker Container Terminal.
This document is a modification of the original tutorial Ligand Docking with a G-Protein Coupled Receptor with added information. A short version is available in the Blog entry about this specific tutorial. This document contains all original writing.
From this point below we’ll start using the original tutorial with the following modifications:
$ROSETTA
environment variables will be
used. On Linux we can simply use /data
. The replace
<path-to-Rosetta>
in the original commands.From this point the text is mostly from the original tutorial, with added options to run from macOS or Linux as well as additional explanations on the commands used.
KEYWORDS: LIGANDS DOCKING
Bold text means that these files and/or this information is provided.
Italicized text means that this material will NOT be conducted during the workshop
fixed width text means you should type the command into your terminal (after > sign)
If you want to try making files that already exist (e.g.,
input files), write them to a different directory!
(mkdir my_dir
)
In addition to following this sample docking problem, the user is encouraged to review the Rosetta user guide including the section on ligand-centric movers for use with RosettaScripts.
https://www.rosettacommons.org/docs/latest/
The experimental data for this tutorial is derived from: Chien, E. Y. T. et al. Structure of the human dopamine D3 receptor in complex with a D2/D3 selective antagonist. Science 330, 1091-5 (2010).
This particular D3/eticlopride protein-ligand complex was used as a target in the GPCR Dock 2010 assessment, the results of which are discussed here: Kufareva, I. et al. Status of GPCR modeling and docking as reflected by community-wide GPCR Dock 2010 assessment. Structure 19, 1108-1126 (2011).
If you are interested in more information on the performance of Rosetta in modeling and docking D3/GPCRs in general, please consult Nguyen, E. D. et al. Assessment and challenges of ligand docking into comparative models of g-protein coupled receptors. PLoS One 8, (2013).
Dopamine is an essential neurotransmitter that exhibits its effects through five subtypes of dopamine receptors, important members of class A G-protein coupled receptors (GPCRs). Both subtype two (D2R) and subtype three (D3R) function via inhibition of adenyl cyclase, and modulation of these two receptors has clinical applications in treating schizophrenia. However, the high degree of binding site conservation between D2R and D3R makes it difficult to generate pharmacological compounds that selectively bind one or the other, and thereby reducing side effects.
Today, we will examine how eticlopride, a D2R/D3R antagonist, binds to human D3R.
A crystal structure is available for the D3R and eticlopride complex (PDB: 3PBL), but for the purposes of this exercise, we will model the protein-ligand interactions anyways. In reality, you may be using a comparative model rather than a crystal structure for the protein receptor, but the steps in this tutorial will apply to both.
For this exercise, we’ll be doing our pre-docking preparations in the
protein_prep
and ligand_prep
folders. The
modeling will be done in the docking
folder. The
scripts
folder contains helpful ligand docking specific
scripts that we’ll be using during this tutorial (you should never
be copying files to or from this folder). All necessary files are
also prepared in the answers directory in case you get stuck.
Go to the desired location:
Navigate to the ligand_docking
directory where you will
find the ligand_prep
, protein_prep
,
docking
, and answers
folders.
On macOS Terminal:
# macOS
cd $ROSETTA3_DEMOS/tutorials/ligand_docking/protein_prep
On Container Terminal:
#Container
cd /data/demos/tutorials/ligand_docking/protein_prep
Prepare a human dopamine 3 receptor structure. We will do this by obtaining the crystal structure (3PBL) and removing the excess information.
The first step has to be accomplished on the Docker Container side as
the called python script clean_pdb.py
calls on
wget
to download a PDB file but wget
is not
installed on macOS by default. The clean_pdb.py
script then
calls on zcat
to unarchive the file. However, on macOS this
software behaves differently and expect a file ending with
.Z
and cause a “file not found” error. Thus it is best to
accomplish this task on the Container side, but since we are sharing the
directories these will “magically” appear on the macOS side as well!
2.1. Change into the protein_prep directory with the cd
command:
We can verify that we are already within the
protein_prep
directory from the last command above.
pwd
The original Tutorial suggested to download the PDB file, but the
clean_pdb.py
script will download it anyway so it is not
really necessary.
2.2. Download
3BLP
(pdb format) from http://www.rcsb.org/pdb/home/home.do into theprotein_prep
directory.
The clean_pdb.py
script will allow you to strip the PDB
of information other than the desired protein coordinates. The
‘A
’ option tells the script to obtain chain A only. The
full crystal structure consists of two monomers as a crystallization
artifact.
On the Container Terminal type:
/data/tools/protein_tools/scripts/clean_pdb.py 3PBL.pdb A
2.3. There are two output files from clean_pdb.py
:
3PBL_A.pdb
contains a single chain of the protein structure
and 3PBL_A.fasta
contains the corresponding sequence.
3PBL_A.pdb
is the receptor structure we will be using for
docking, copy this into the docking directory.
cp 3PBL_A.pdb ../docking
Note: This structure has a T4-lysozyme domain instead of the third cytoplasmic loop as a stabilizing feature for crystallography. Normally, we would truncate this lysozyme segment and perform loop modeling as discussed in the comparative modeling tutorial to regenerate the intracellular loop. However in the interest of time, we will use the lysozyme containing structure as the eticlopride binding site is far from the intracellular domain.
Next, we will prepare the ligand files by generating parameters using a eticlopride conformational library.
For more information about the ligand preparation, check ligand preparation tutroial| prepare_ligand.
3.1. cd
into the directory named
ligand_prep
cd ../ligand_prep
3.2.In the directory, you will find a pair of already prepared files:
eticlopride.sdf
and
eticlopride_conformers.sdf
3.2.1. eticlopride.sdf
: This contains the eticlopride
structure found in the 3PBL protein complex.
Note: You can also find the ligand file from this particular PDB structure by going to the 3PBL page and scrolling down to the “Ligand Chemical Component” section. From there, you can click “Download” under the ETQ identifier.
3.2.2. eticlopride_conformers.sdf
: This is a library of
conformations for eticlopride generated outside of Rosetta. The
downloaded ligand .sdf
file only contains conformations
found in the PDB so we must expand the library to properly sample the
conformational space. We also need to add hydrogens since they are not
resolved in the crystal structure. Feel free to open the file in PyMOL
and use the arrow keys to scroll through the different
conformations:
# Do not run
pymol eticlopride_conformers.sdf
The command pymol eticlopride_conformers.sdf
assumes a
Linux computer with a full Graphical interface and will not work on
macOS or within the Docker Container as it is running as
“Text-only.”
To open PyMOL from command line on a Mac use the command:
open -a /Applications/PyMOL.app
Then slide the file eticlopride_conformers.sdf
onto
PyMOL with your mouse (using the file name as argument does not
currently open it on macOS.) If you are not sure where the file is on
your Mac, the following 2 commands will open the folder where it’s
located:
# Go to location:
cd ROSETTA3_DEMOS/tutorials/ligand_docking/ligand_prep
# Open current folder represented by a dot: .
open .
This particular conformational library was generated using the Meiler lab’s BioChemicalLibrary (BCL). The BCL is a suite of tools for protein modeling, small molecule calculations, and machine learning. If you’re interested in licensing the BCL, please visit http://www.meilerlab.org/bclcommons or ask one of the instructors.
Other methods of ligand conformer generation include OpenEye’s MOE software, CSD Mercury software (CSD Conformer Generator) and web-servers such as Frog 2.1 or DG-AMMOS. The generated libraries will differ depending on the chosen method.
3.2.3. Generate a .params
file and
associated PDB conformations with Rosetta atom types for eticlopride. A
.params
file is necessary for ligand docking because
Rosetta does not have records for custom small molecules in its
database.
This step is done using the script
molfile_to_params.py
.
Note: This command can be run from either Terminal… However,
the first line of the script reads: #!/usr/bin/env python
which assumes that the computer environment has a python
path defined. On my Mac it is currently defined as python3
and therefore it complains with
env: python: No such file or directory
. This is
fixed easily by adding python3
in front of the actual
command on the macOS side.
You can run this step with either of the following commands. The
first one could also work on Mac if
python
is defined as such. For the Mac Terminal option I
make use of the environment variable $ROSETTA3
. You can use
the option -h
first as suggested in the tutorial to
understand the meaning of the command:
Option 1 (In Container):
# Show help
/data/source/scripts/python/public/molfile_to_params.py -h
# Run
/data/source/scripts/python/public/molfile_to_params.py -n ETQ -p ETQ --conformers-in-one-file eticlopride_conformers.sdf
Option 2 (Mac Terminal):
# Show help:
python3 $ROSETTA3/scripts/python/public/molfile_to_params.py
# Run
python3 $ROSETTA3/scripts/python/public/molfile_to_params.py -n ETQ -p ETQ --conformers-in-one-file eticlopride_conformers.sdf
Note: You may encounter a warning about the number of atoms in the residue. This is okay as Rosetta is merely telling you that the ligand has more atoms than an amino acid.
File ETQ.params
contains the necessary information for
Rosetta to process the ligand, ETQ.pdb
contains the first
conformation, and ETQ_conformers.pdb
contains the rest of
the conformational library.
Note: The tutorial assumes that we are within the
ligand_prep/
directory, but also calls the sdf
file with ligand_prep/eticlopride_conformers.sdf
which will
cause an error since we are already within that directory. Thus
the directory name has been removed from the above commands.
3.2.4. If you use the tail
command on
ETQ.params
, you will notice the PDB_ROTAMERS
property line that tells Rosetta where to find the conformational
library. Make sure this line has ETQ_conformers.pdb
as the
property.
tail ETQ.params
Note: The same command could be given on the macOS side as well.
3.2.5. Now that we have the necessary files for ligand docking, let’s
copy them over the ligand_docking
directory.
cp ETQ* ../
Note: The same command could be given on the macOS side as well. Remember that bot Terminal are “looking within” the exact same directory.
Now we want to make our final preparations in the docking directory.
4.1. Switch over to our ligand_docking
directory
cd ../
4.2. Open up our prepared receptor and ligand structures to examine the complex
# Do not run
pymol 3PBL_A.pdb ETQ.pdb
The command pymol 3PBL_A.pdb ETQ.pdb
invites to explore
the complex graphically. Use the methods described previously for
graphical exploration.
(Original NOTE: if you cannot open pymol
from the
command line, you may need to set up your bash
environment.)
Now make a pdb file by concatenating your protein and ligand, running this script:
cp protein_prep/3PBL_A.pdb .
cat 3PBL_A.pdb ETQ.pdb > 3PBL_ETQ.pdb
If you don’t have these files, copy them from the
answers
directory:
cp answers/docking/3PBL_ETQ.pdb .
4.3. Tip: ‘All->Action->preset->ligand sites->cartoon’ will help you visualize the protein/ligand interface. The All button is denoted by a single letter “A” in Pymol GUI.
Since this is a rudimentary exercise, we will start with the ligand in the protein binding site.
In practical application, we may need to define a starting point with the StartFrom mover or to manually place the ligand into an approximate region using PyMOL.
4.4. Once you close PyMOL, make sure Rosetta has these four necessary
input structure/parameter files in the ligand_docking
tutorial directory. If you are missing any of these, copy them from
../answers/docking/
3PBL_A.pdb
: a single chain of the protein receptor
structureETQ.pdb
: a default starting conformation for
eticloprideETQ_conformers.pdb
: A pdb file containing all
conformers generated from the eticlopride library.ETQ.params
: a Rosetta parameter file that provides the
necessary properties for Rosetta to treat eticloprideNext we need to make sure we have the proper RosettaScripts|rosetta_scripting
XML file, input options file, and crystal complex (correct answer).
These files are provided to you as dock.xml
,
options.txt
, and crystal_complex.pdb
. Copy
these to your ligand_docking
directory:
# Remember that . is the current dir
cp docking/dock.xml .
cp docking/options .
cp docking/crystal_complex.pdb .
dock.xml
- This is the RosettaScripts XML file that
tells Rosetta the type of sampling and scoring to do. It defines the
scoring function and provides parameters for both low-resolution coarse
sampling and high-resolution Monte Carlo sampling.options.txt
- This is the options file that tells
Rosetta where to locate our input PDB structures and ligand parameters.
It also directs Rosetta to the proper XML file.crystal_complex.pdb
- This is the D3-eticlopride
complex from the PDB. It will serve as the correct answer in our case
allowing us to make comparisons between our models and actual
structures.Note: This is the point where using macOS binaries would prove valuable for larger computations and when it is useful to have the macOS binaries installed. For large projects the Mac binaries will run faster than running those within the Docker Container by emulation which would mean a double emulation on an M Series Silicon Mac: one to convert the Linux code, and the second to convert from Intel (amd64) to Silicon Chip (arm64), which is done seemlesly for Mac users by a utility aptly called “Rosetta2” but has not connection with the Rosetta we are using for protein or docking computations!
IMPORTANT NOTE: The name of the binary will differ depending on the operating system.
Run the docking study (This should take a few minutes at most, as we’re using a reduced number of output structures):
The command in the original tutorial assumes a standard installation, with binary:
$> $ROSETTA3/bin/rosetta_scripts.linuxgccrelease @options
However, within the Docker Container the binaries have a different
naming convention. The binaries within the container are within
/usr/local/bin
and the specific one to call for this
section is named:
rosetta_scripts.cxx11threadserialization.linuxgccrelease
On the Mac it will be:
$ROSETTA3/bin/rosetta_scripts.static.macosclangrelease
To run the docking use the appropriate binary, followed by
@options
On the Mac it would be:
$ROSETTA3/bin/rosetta_scripts.static.macosclangrelease @options
The Rosetta models are saved with the prefix 3PBL_ETQ_
followed by a four digit identifier such as
3PBL_ETQ_0001.pdb
. Each model PDB contains the coordinates
and Rosetta score corresponding to that model. In addition, the model
scores are summarized in table format in the score.sc
file.
The two main scoring terms to consider are:
total_score
: the total score is reflective of the
entire protein-ligand complex and is good as an overall model
assessmentinterface_delta_X
: the interface score is the
difference between the bound protein-ligand complex and the unbound
protein-ligand. Interface score is useful for analyzing ligand effects
and for comparing different complexes.One other metric to keep an eye on is the
Transform_accept_ratio
. This is the fraction of Monte Carlo
moves that were accepted during the low resolution Transform grid
search. If this number is zero or very low, the search space may be too
restrictive to allow for proper sampling.
In benchmarking examples when we have a correct crystal structure,
ligand_rms_no_super_X
will give us the RMSD difference
between our model ligand and the crystal structure ligand given in
crystal_complex.pdb
.
This is an important metric when benchmarking how well your models correlate to reality. When the crystal structure is unknown, we can also calculate model RMSDs using the best scoring structure as the “true answer”.
Use PyMOL to visually compare your best-scoring model and
worse-scoring model with the crystal structure provided in
crystal_complex.pdb
.
The “All->Action->preset->ligand sites->cartoon” setting in PyMOL is ideal for visualizing interfaces.
What interactions were successfully predicted by Rosetta?
The visualize_ligand.py
script in the
scripts
directory is a useful shortcut to doing quick
visualizations of protein-ligand interfaces.
It takes in a PDB and generates a .pse
PyMOL session by
applying common visualization settings. The example below shows the
command lines for using this script on the 0001
model but
you are free to try it on any one (or more!) of your models.
This can to be done on the Docker Container side if the
pymol
Python package was installed (see above.) Therefore
this is a step that cannot be run “as is” on macOS.
Note that the tutorial file name is 3PBL_A_ETQ_0001.pdb
but our result files do not have the _A_
portion.
Run this command from the Docker Container Terminal, assuming we are
within the ligand_docking
directory which in turn contains
the scripts
directory:
scripts/visualize_ligand.py 3PBL_ETQ_0001.pdb
On the Mac side, double click the resulting
3PBL_ETQ_0001.pse
file which will open with PyMOL.
Since we generated such a small number of structures, it is unlikely
to capture all the possible binding modes that you would expect to
encounter in an actual docking run. In the out
directory,
there are 50 models pre-generated using the exact same protocol.
We will look at an example of how we can analyze this dataset.
cd
into the out directory in your
ligand_docking
:
cd out
Directory out
also contains additional files:
score.sc
, score_vs_rmsd.csv
,
rmsds_to_best_model.data
, and several .png
image files.
score.sc
: summary score file for the 50 structures as
outputted by Rosetta.score_vs_rmsd.csv
: a comma separated file with the
filename in the first column, total_score for the complex in the second
column, the interface score in the third column, and ligand RMSD to the
native structure in the fourth column.This file was tabulated using the extract_scores.bash
script and the score.sc
file as input.
This is a very specific script made for extracting useful information in ligand docking experiments. However, the script can be easily customized for extracting other information from Rosetta score files. If you have any in-depth questions about how it works or how to modify it, feel free to ask.
To see how it in action, run:
# assumes we are in the out directory
../scripts/extract_scores.bash score.sc
rmsds_to_best_model.data
: a space separated file
containing RMSD comparisons with the best scoring model (not crystal
structure!) for all PDB files.A more detailed discussion of this file will come further down in the tutorial. This file has the filename in the first column, an all heavy-atom RMSD in the second column, a ligand only RMSD without superimposition in the third column, a ligand only RMSD with superimposition in the fourth column, and heavy atom RMSDs of side-chains around the ligand in the fifth column.
This file is generated using the
calculate_ligand_rmsd.py
script. It uses pymol
to compare PDB structures containing the same residues and ligand atoms.
(As such this command does not run by default on macOS.)
It’s a quick way of calculate ligand RMSDs of Rosetta models.
To see how this works, let’s try it on the five models we generated in the previous steps:
STOP
The script calculate_ligand_rmsd.py
on lines 83 and 221
using the print
statement from Python version 2 that will
cause an error. These 2 lines need to be converted to the
print()
function of Python 3. This can be done with the
following stream editor (sed
) regular
expression script which edit and overwrite the original script:
# Assumes we are still in out directory
sed -i -r 's/^(\s*print)\s+(.*)/\1(\2)/g' ../scripts/calculate_ligand_rmsd.py
Note: if there is an error such as sed: couldn’t open
temporary file: Permission denied remove -i
and create
a new file with e.g. .py3
extension:
# Assumes we are still in out directory
sed -r 's/^(\s*print)\s+(.*)/\1(\2)/g' ../scripts/calculate_ligand_rmsd.py > ../scripts/calculate_ligand_rmsd.py3
# make it executable
chmod a+x ../scripts/calculate_ligand_rmsd.py3
(Credit: Copilot converted the vi
editor
command found on StackOverflow.)
See the Appendix
of my Blog for details.
Back to the Tutorial material:
# Change directory up one level
cd ../
# Run the script with .py or .py3 ending (see above)
./scripts/calculate_ligand_rmsd.py -n 3PBL_ETQ_0003.pdb -c X -a 7 -o rmsds_to_best_model.data *_000*.pdb
Run the script with -h
to obtain relevant information
that will detail the chosen options. For example -c X
is
for choosing chain X
.
This command compares all five of your models to the one after the
-n
option. Your best scoring model may not be the one
labelled 0003 so feel free to customize that option. The -c tells the
script that the ligand is denoted as chain X. The -a tells the script to
use 7 angstroms as the cutoff sphere for side-chain RMSDs. The -o option
is the output file name. Lastly, we provided a list of PDBs using the
wildcard selection.
The script produces the rmsd_to_best_model.data
file
that you can open in any text editor. Feel free to ask questions if you
like to discuss more of how to customize this script for your own
applications.
Now let’s go back to the pre-generated model directory:
cd out
PNG files
: plots made from the various data file
mentioned above. Python and the matplotlib
package was used
here but you are free to use any plotting software you prefer.In this case, we have the correct answer based on the crystal structure so we can examine a score vs rmsd plot to see if the better scoring models are indeed closer to the native ligand binding mode.
Open up the plot with the following command:
Use macOS command below or choose graphically.
# gthumb score_vs_crystal_rmsd_plot.png <- tutorial orginal command
# cd to out directory
cd $ROSETTA3_DEMOS/tutorials/ligand_docking/out
# Open file
open -a /System/Applications/Preview.app score_vs_crystal_rmsd_plot.png
On the X-axis you will see the ligand RMSD to the ligand in the crystal structure. On the Y-axis you will see the interface delta score in Rosetta Energy Units. Notice the general correlation between RMSD and Rosetta Score, with a large cluster of highly accurate and good scoring models in the lower left hand corner.
In practical applications, we would not have the crystal structure for comparison. However, we can treat the best scoring model as the correct model and see if we generate a similar funnel.
This is one application of how we might use the
calculate_ligand_rmsd.py
script discussed earlier.
Once we identify a desired “best model”, we can run the script to
generate the rmsds_to_best_model.data
.
Some scripting may be required to put the information from multiple files together, depending on which software package you choose to graph with. To identify the best scoring model for this example, I selected the top 200 models based on the best overall score and then identified the best model by interface score.
The best model for these plots is 3PBL_A_ETQ_0347.pdb
.
Open up the first plot with:
# gthumb score_vs_low_rmsd_plot.png <- tutorial original command
# macOS:
open -a /System/Applications/Preview.app score_vs_low_rmsd_plot.png
Again, we see a cluster of good scoring models near the best scoring model with a general downward trend further away. We can zoom in on the cluster in the lower left hand corner to get an even better picture.
# gthumb score_vs_low_rmsd_zoom_plot.png <- tutorial original command
# macOS:
open -a /System/Applications/Preview.app score_vs_low_rmsd_zoom_plot.png
We see the same overall trend in this cluster, suggesting that the top scoring models in this run are likely to be good predictors of the true ligand binding position.
Finally let’s look at some structures. To sort the CSV file by interface score and take the top twenty, type:
sort -t, -nk3 score_vs_rmsd.csv | head -n 20
Note: -t,
defines the column/field delimiter as
comma (since we are using a CSV file.) -n
sorts
numerically, on the 3rd column: -k3
These should all be very low RMSD models. To compare a certain structure to the native in PyMOL, use:
# Do not run
# pymol 3PBL_ETQ_0001.pdb ../crystal_complex.pdb <- tutorial original command
Open PyMOL on your Mac with open -a
command shown
previously.
Don’t forget the ligand site preset mode for visualizing interfaces
or use the visualize_ligand.py
script to generate PyMOL
session .pse
files. If you like, we can also look at some
of the poor scoring models to see exactly what went wrong. To find the
top 20 worse models by interface score:
sort -t, -nk3 score_vs_rmsd.csv | tail -n 20
3PBL_A_ETQ_0033.pdb
(or 3PBL_ETQ_0033.pdb
)
should come up as a poor scoring, high RMSD structure. When we open it
up in PyMOL, we can see that the ligand binding direction is different
from the native position. This can happen when there is an extended
binding pocket but in this case, the Rosetta score was able to discern
the difference between these models.
# Do not run
# pymol 3PBL_A_ETQ_0033.pdb ../crystal_complex.pdb
Open PyMOL on your Mac with open -a
command shown
previously.
Some notes on binding pockets and adjusting the sampling space of the ligand
Q: My protein has a quite large cavity and a small
ligand (not bigger than a Leucine). In the XML file these are the
standard parameters:
<Transform name="transform" chain="X" box_size="7.0" move_distance="0.2" angle="20" cycles="500" repeats="1" temperature="5"/>
.
Why is the move_distance
and the angle so small? Would it
make sense to increase the box_size
to 12, the
move_distance
to 5 and the angle to 360 to sample more
space the ligand is allowed to move in?
A: The Transform algorithm is a Monte Carlo
procedure, and the move_distance
and angle are the size for
the individual steps in the MC protocol, not the maximum extent of the
movement. They’re also the sd
of a gaussian, so
you’re not necessarily limited to the given amount in any given step.
That said, if you’re increasing the size of the pocket, it might make
sense to bump the move size up in proportion. (So for a box size of 12,
a move_distance
of 0.25 to 0.5 or so might be
appropriate.)
If you’re exploring the pocket, I’d also suggest setting the
initial_perturb
option. By default Transform will always
start with the input position. If you add the
initial_perturb=X.X
option, then it will first randomize
the starting location of the ligand within an X.X
Angstrom
sphere from the starting position, as well as randomizing the
orientation. – And a new random position/orientation will be taken for
each nstruct, so you can sample the pocket even if your MC moves aren’t
sufficient to wander across it. Also, if you’re increasing the size of
the pocket, you’re likely going to need to increase the size of the Grid
such that it will cover the maximal extent of ligand travel. If it
doesn’t, Transform will reject any ligand which accidentally falls
outside the grid.
Q: Does the initial_perturb
option also
randomize the angles? Because there is another option
initial_angle_perturb
in the
TransformMover
.
A: Yes, by default setting
initial_perturb
will completely randomize the angles
(ligand orientation) - the initial_angle_perturb
is there
if you want to reduce the angle perturbation. (e.g. if you’re
refining the orientation)
Congratulations, you have performed RosettaLigand docking study! Now use your docked models to generate hypotheses and test them in the wet lab!
The out
directory (see above) PNG plots of comparison of
models with the crystal structure. There is no code explaining the
creation of the plots and the text suggests that the plots were derived
from “top 200 models” i.e. more than the 50
structures included in the directory. However, we could still plot
results from these 50 structures with either R
or
python
.
The following code was created by the Microsoft Copilot AI and is minimally edited for the name and location of the input file(s). It was tested and works!
Assumes ggpplot2
has been
installed, or install with command at the R console:
install.packages("ggplot2")
.
Make sure that you are in the directory containing the desired CSV
file, or pointing to the correct directory (here ./out
).
The size and color of the points within the plot was also added
afterwards.
The plot is shown graphically but can be saved manually.
# Read the data from a text file (assuming the file is named 'data.txt')
data <- read.csv("out/score_vs_rmsd.csv", header=FALSE)
# Extract the 3rd and 4th columns
x <- data$V3
y <- data$V4
# Load the ggplot2 library for plotting
library(ggplot2)
# Create a scatter plot
ggplot(data, aes(x=x, y=y)) +
geom_point(colour = "red", size = 3) +
theme_minimal() +
labs(title="Scatter Plot of 3rd and 4th Columns",
x="3rd Column",
y="4th Column")
Assumes pandas
and matplotlib
are installed
(e.g. using pip
with the more modern command as:
python -m pip install pandas matplotlib
).
In python the first item is o
hence 3rd and 4th columns
are columns 2 and 3 (unlike in R
.)
The script will export a PNG file called plot.png
import pandas as pd
import matplotlib.pyplot as plt
# Read the data from a text file (assuming the file is named 'data.txt')
data = pd.read_csv('./out/score_vs_rmsd.csv', sep=',', header=None)
# Extract the 3rd and 4th columns
x = data.iloc[:, 2]
y = data.iloc[:, 3]
# Create a scatter plot
plt.scatter(x, y)
plt.title('Scatter Plot of 3rd and 4th Columns')
plt.xlabel('3rd Column')
plt.ylabel('4th Column')
plt.grid(True)
# Save the plot to a PNG file
plt.savefig('plot.png')