Comparative/Homology Modeling

Comparative ("Homology") Modeling for Beginners
with Free Software
by Eric Martz, June 2001

Disclaimer: I have neither experience nor expertise in comparative ("homology") modeling. In my molecular visualization workshops I am often asked about it, so I have gathered the information below.

This document is a supplement to Protein Explorer. Comparative modeling cannot be done within Protein Explorer, but a comparative model produced outside of Protein Explorer with the methods below can then be loaded into Protein Explorer for visualization.

If you are reading this on paper, you can use the hyperlinks at

http://molvis.sdsc.edu/protexpl/homolmod.htm

Summary.

Comparative ("homology") modeling approximates the 3D structure of a target protein for which only the sequence is available, provided an empirical 3D "template" structure is available with >30% sequence identity. In 2001, about 20% of sequences (in Swiss-Prot/TrEMBL) have suitable templates for comparative modeling at least part of the sequence. Comparative models are useful to get a rough idea where the alpha carbons of key residues sit the folded protein. They can guide mutagenesis experiments, or hypotheses about structure-function relationships. Comparative models are unreliable in predicting the conformations of insertions or deletions, i.e. portions of the sequence that don't align with the sequence of the template, as well as the details of sidechain positions. Comparative models are unlikely to be useful in modeling ligand docking (drug design) unless the sequence identity with the template is >70%, and even then, less reliable than an empirical crystallographic or NMR structure.

SWISS-MODEL makes it quick and easy to submit a target sequence and get back an automatically generated comparative model, provided an empirical structure with >30% sequence identity exists to use as a template. (The template will be identified automatically, and the alignment made automatically.) These automated models may be useful, but will sometimes have errors that could be avoided if manual adjustments are made to the sequence alignment by an expert. Learning to optimise your models manually would take some time (see resources below).

In 2002, a new automated comparative modeling server came on-line: ESyPred3D.

DeepView is freeware integrated with SWISS-MODEL to help you visualize and evaluate the model, aligned with the template. The best way to learn how to do this is with Gale Rhodes' superb tutorial.

Contents

What is comparative modeling?

What are the uses of comparative models?

What if there is no template?

How good can comparative modeling be?

The importance of the sequence alignment.

Databases of ready-made comparative models.

Introductions to the Principles of Comparative Modeling.

Tutorials and Courses: How To Do Comparative Modeling.

Comparative Modeling Servers and Software.

References.

1. What is comparative modeling?

Suppose you want to know the 3D structure of a target protein that has not been solved empirically by X-ray crystallography or NMR. You have only the sequence. If an empirically determined 3D structure is available for a sufficiently similar protein (50% or better sequence identity would be good), you can use software that arranges the backbone of your sequence identically to this template. This is called "comparative modeling" or "homology modeling". It is, at best, moderately accurate for the positions of alpha carbons in the 3D structure, in regions where the sequence identity is high. It is inaccurate for the details of sidechain positions, and for inserted loops with no matching sequence in the solved structure.

A comparative modeling routine needs three items of input:

The sequence of the protein with unknown 3D structure, the "target sequence".
A 3D template is chosen by virtue of having the highest sequence identity with the target sequence. The 3D structure of the template must be determined by reliable empirical methods such as crystallography or NMR, and is typically a published atomic coordinate "PDB" file from the Protein Data Bank.
An alignment between the target sequence and the template sequence.

First, the comparative modeling routine arranges the backbone identically to that of the template. This means that not only the positions of alpha carbons, but also the phi and psi angles and secondary structure, are made identical to the template. Next, the more sophisticated comparative modeling packages adjust sidechain positions to minimize collisions, and may offer further energy minimization or molecular dynamics in an attempt to improve the model.

2. What are the uses of comparative models?

Successful predictions based on comparative models have been reviewed by Baker and Sali (2001). The following is summarized from their review, where references to specific cases may be found. The positions of conserved regions of the protein surface can help to identify putative active sites and binding pockets. If the ligand is known to be charged, the binding site may be predicted by searching the surface for a cluster of complementary charges. The size of a ligand may be predicted from the volume of the putative binding pocket. In one case, relative affinities of a series of ligands have been predicted. Such predicions are useful to guide mutagenesis experiments.

2. What if there is no template?

If there is no empirically determined structure with at least 30% sequence similarity to the target sequence, then there may be no template available that is suitable for reliable comparative modeling. In 2001, this is true for >80% of the sequences in Swiss-Prot (Liisa Holm, personal communication). The goal of "structural genomics" is to crystallize proteins, or protein domains, selected to provide templates for families of related sequences for which suitable templates are lacking. By one estimate (Vitkup et al., 2001), providing templates for 90% of all protein domain families, including membrane proteins, will require the empirical solution of about 16,000 new domain structures. This is well in excess of the structural information presently in the Protein Data Bank, but may be achievable within a decade as a result of advancements in high throughput crystallography.

When no suitable template is available for comparative modeling, de novo modeling methods (also called ab initio modeling) may be used. The success rate with such modeling is considerably lower than that with comparative modeling. According to Baker and Sali: "For roughly 40% of proteins shorter than 150 amino acids [<15 kD] that have been examined, one of the five most commonly recurring models generated by Rosetta has sufficient global similarity to the true structure to recognize it in a search of the protein structure database. ... The accuracy of de novo models is too low for problems requiring high-resolution structure information."

2. How good can comparative modeling be?

Two proteins with a high level of sequence identity, and very similar secondary and tertiary structure (identical "folds"), will nevertheless have not exactly identical backbone conformations, even when determined under comparable conditions. A comparative model can be expected to differ from the real structure to at least this extent. Overall differences in protein backbone structures are quantitated with the root mean square deviation of the positions of alpha carbons, or rmsd. "A model can be considered 'accurate enough' or as 'accurate as you can get' when its rmsd is within the spread of deviations observed for experimental structures displaying a similar sequence identity level as the target and template sequences" (Schwede et al., 3DCrunch). How big is this spread?

The 3DCrunch project used the SWISS-MODEL routines to do comparative modeling on all sequences in the Swiss-Prot database for which appropriate templates exist. (In 2001, about 20% of the sequences have templates with >30% sequence identity with at least part of the sequence [Liisa Holm, personal communication].) In the same project, in order to assess the accuracy of comparative modeling, 1,200 models were made for previously solved structures (see Reliability of models generated by SWISS-MODEL). This enabled comparisons of comparative models with empirical structures for the same sequence, where the comparative model was made using a template with the most similar sequence available, other than the target sequence itself.

To provide a frame of reference for rmsd values, note that up to 0.5 Å rmsd of alpha carbons occurs in independent determinations of the same protein (Chothia and Lesk, 1996). Proteins with 50% sequence identity have on average 1 Å rmsd ( Schwede et al., 3DCrunch). The values given above are for X-ray crystallographic determinations; NMR determinations have rmsd's several fold higher.

If we define a "highly successful comparative model" as one having <=2 Å rmsd from the empirical structure, then the template must have >=60% sequence identity with the target for a success rate >70%. Even at high sequence identities (60%-95%), as many as one in ten comparative models have an rmsd >5 Å vs. the empirical structure. Below 40% sequence identity, serious errors begin to appear more often. For the complete distribution of results, see Reliability of models generated by SWISS-MODEL, particularly Table I.

3. The importance of the sequence alignment.

The comparative modeling routine will proceed to arrange the backbone of the target sequence according to that of the template, using the sequence alignment to decide where to position each residue. Therefore, the quality of the sequence alignment is of crucial importance. Misplaced indels (gaps representing insertions or deletions) will cause residues to be misplaced in space. Although there are many routines that will do alignments automatically, careful inspection and adjustment by someone with specialized training may improve the quality of the alignment, and hence, of the comparative model. Good tutorials on such corrections will be found under the links Correcting alignments in Gert Vriend's Homology Modeling Course. DeepView (see below) provides features that assist in adjusting the alignment easily.

4. Databases of Ready-Made Comparative Models.

ModBase is worth checking because if you find a model, it provides a PIR-formatted sequence alignment ready to paste into Protein Explorer's MSA3D (see below). 3DCrunch does not provide this. It might also be worth comparing models of the same sequence from ModBase vs. SWISS-MODEL because they use different algorithms.

It is quicker and easier to submit your sequence to SWISS-MODEL than to try to find a model in 3DCrunch, and you'll get the same "first approach" results either way. 3DCrunch appears not to have been updated since 1998, and only sequences in Swiss-Prot/TrEMBL were modeled, whereas you can submit any sequence to SWISS-MODEL.

ModBase (Andrej Sali et al., Rockefeller U, NY). Over 200,000 models, last updated July 2000. If your search finds models, click on the icon in the "Template-based view" column to get the model. If you find a model here the PIR alignment link will generate the alignment of the template with the target ready to paste into Protein Explorer's MSA3D. This will color the model by identity/similarity/difference from the template. Inserted loops are colored 'different'.
3DCrunch (Manuel Peitsch et al., GlaxoWellcome). 64,000 models made in 1998 from sequences in Swiss-Prot/TrEMBL using the SWISS-MODEL routines. Particularly interesting are the control data, Reliability of models generated by SWISS-MODEL.

5. Introductions to the Principles of Comparative Modeling.

Homology Modeling David R. Bevan, Virginia Tech.
Professional Gambling, R. Rodriguez, Gert Vriend, EMBL Heidelberg, Germany (since 2000, Univ. Nijmegen, Netherlands).
How to evaluate the quality of a model, Torsten Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.

6. Tutorials and Courses: How To Do Comparative Modeling.

Molecular Modeling for Beginners by Gale Rhodes, Univ. Southern Maine, includes an introduction to DeepView, and a superb tutorial on comparative modeling (look through the left index frame for the link to Homology Modeling).

This is the best starting place for beginners who want to learn about comparative modeling. It guides you through the use of NCBI Entrez to find a sequence in the human genome, using SWISS-MODEL to get a comparative model, and most importantly, using DeepView to visualize and evaluate the model.

DeepView (also known as SwissPDBViewer) is an excellent free modeling program by Nicolas Guex, Alexandre Diemand, Torsten Schwede & Manuel C. Peitsch at GlaxoWellcome. DeepView resources are indexed at molvisindex.org. DeepView comes with a built-in tutorial on homology modeling. This tutorial walks you through the steps but does not explain in detail what the program is doing. The SWISS-MODEL comparative modeling server returns a DeepView-ready PDB file, with the model and each template in a different layer. DeepView has automated routines to display the sequence alignment, adjust gap positions, show energetically unfavorable regions of the alignment, find and fix sidechain clashes. It is very powerful but the many keyboard shortcuts and hard-to-find options make it a challenge to use effectively on an occasional basis. The best place to start is Rhodes' tutorial (see immediately above).
Homology Modeling, Gert Vriend, University of Nijmegen (in USA, say "Nigh-maygen"), Netherlands.
Principles of Protein Structure, Comparative Protein Modelling and Visualisation, Nicolas Guex and Manuel C. Peitsch, GlaxoWellcome, Plan-les-Ouates, Switzerland.

7. Comparative Modeling Servers and Software.

SWISS-MODEL, An Automated Comparative Protein Modelling Server, Torsten Schwede, Manuel C. Peitsch & Nicolas Guex, ExPASy, Geneva, Switzerland.

SWISS-MODEL accepts one-letter amino acid code. If you need to convert your sequence from three-letter code, you can do it at Paul Stothard's Sequence Manipulation Site (U Alberta, Edmonton, Canada).

Requirements for Swiss-Model:

BLAST search P value <10^-5.
>25% sequence identity with a template.
Minimal projected model length of 25 amino acids.

ESyPred3D is a newer automated comparative modeling server. You need only submit a sequence; the sequence alignment is done automatically. Optionally, you may specify one of three alignment methods, and/or you may specify the PDB ID and chain for a template. This server is described in an article in Bioinformatics, Sept. 2002.
DeepView: Integrated with SWISS-MODEL. See above under Tutorials.
WHAT IF Web Interface (click on Build/check/repair model). Roland Krause, Gert Vriend, Univ. Nijmegen (in USA, say "Nigh-maygen"), Netherlands.
To use the WHAT IF model builder, you must choose your template and prepare your alignment first. Instructions for doing these are beyond the scope of the present guide.
The following opinion was sent to the Protein Data Bank Discussion Forum in November, 1999 by Gert Vriend:
There are several other comparative modeling servers, but they appear less fully developed than the two above.

8. References