Sequence Analysis Related Web Servers
by Christophe Lambert
[an error occurred while processing this directive]
0. Contents
- Books
- Journals, Newsletters & Bulletins
- Internet Software&Documentation
-
- Protein sequence databases
- General databases
- Specialized databases
- Databases of 3D protein structures
- Genomic resources
- Indexed access to the databanks
- Genomic tools
-
- Searching by sequence similarity
- Searching by structural similarity
-
- Multiple Sequence Alignment
- Pairwise Sequence Alignment
-
-
- Homology Modeling
- Fold recognition
-
- Atomic level
- Amino acid level
-
1. General BioScience Resources
1.1 Books
- M.J.E. Sternberg (1996), Protein Structure Prediction. A Practical Approach, IRL Press, Oxford.
- R. F. Doolittle (1996), Computer Methods for Macromolecular Sequence Analysis, Methods in Enzymology, volume 266. (DOOL1996)
1.2 Journals, Newsletters & Bulletins
1.3 Internet Software&Documentation
2. Databases
2.1 Protein sequence databases
2.1.1 General databases
- GenBank (Documentation): GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
- PIR: The Protein Identification Resource consists of an integrated computer system composed of a number of protein and nucleic acid sequence databases and software designed for the identification and analysis of protein sequences and their corresponding coding (DOOL1996, p. 41)
- SWISS-PROT(Documentation): SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc), a minimal level of redundancy and high level of integration with other databases (DOOL1996, p. 4)
- TrEMBL: TrEMBL is a supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT
2.1.2 Specialised databases
- BLOCKS: The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution of matches. It is these calibrated blocks that make up the BLOCKS database. (DOOL1996, p. 88)
- EMOTIF: EMOTIF is a research system that forms motifs for subsets of aligned sequences. EMOTIF ranks the motifs that it finds by both their specificity (expected false positives) and the number of supplied sequences that it covers (true positives).
- GPCRDB: The G protein-coupled receptor database (GPCRDb) was started in 1989 to keep track of all new sequence data of this biologically important class of proteins. The systematic collection of these data has been a large undertaking which has been aided by Amos Bairoich, Gert Vriend, Kevin Lynch and others.
- IDENTIFY: IDENTIFY is the resulting database of EMOTIF motifs from all protein alignments in the BLOCK and PRINTS databases. For each alignment, the database contains several motifs having a probability of matching a false positive. Highly specific motifs are well suited for searching entire proteomes, while generating very few false predictions.
- IMGT: IMGT, the international ImMunoGeneTics database, is a high-quality integrated database specialising in Immunoglobulins (Ig), T-cell receptors (TcR) and Major Histocompatibility Complex (MHC) molecules of all vertebrate species, created in 1989 by Marie-Paule Lefranc. IMGT includes three databases : LIGM-DB, a comprehensive database of Ig and TcR from human and other vertebrates, with translation for fully annotated sequences, MHC/HLA-DB, and PRIMER-DB (the last two in development). A tool, IMGT/DNAPLOT, allows Ig, TcR and MHC sequence analysis.
- OWL: The OWL database is a non-redundant protein sequence database produced from the following source databases: SWISSPROT, PIR, GenBank, NRL-3D
- PRINTS: Protein Motif Fingerprint Database. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: the database thus provides a useful adjunct to PROSITE
- PRODOM: The ProDom protein domain database consists of an automatic compilation of homologous domains detected in the SWISS-PROT database. The current version of ProDom was built using an entirely novel procedure based on recursive PSI-BLAST searches. Large families are much better processed with this new procedure than with the former DOMAINER program . However please note that false positives occur more frequently, and in some instances closely related families can fail to be clustered.
- PROSITE: PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs
- PSORT: PSORT is a computer program for the prediction of protein localization sites in cells. It receives the information of an amino acid sequence and its source orgin, e.g., Gram-negative bacteria, as inputs. Then, it analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional information.
- SignalP: The SignalP World Wide Web server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
2.2 Databases of 3D protein structures (DOOL1996, p. 643)
- CATH: CATH is a novel hierarchical classification of protein domain structures, which clusters proteins at four major levels, class(C), architecture(A), topology(T) and homologous superfamily (H). Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures according to their toplogical connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons
- FSSP: The FSSP database is based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB). The classification and alignments are automatically maintained and continuously updated using the Dali search engine. (DOOL1996, p. 653)
- NRL-3D: The NRL_3D Sequence Database is produced by PIR from sequence and annotation information extracted from the Brookhaven Protein Databank (PDB) of crystallographic structures.
- PDB: The Protein Data Bank (PDB) is an archive of experimentally determined three-dimensional structures of biological macromolecules, serving a global community of researchers, educators, and students.
- SCOP: (DOOL1996, p. 635) The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.(DOOL1996, p. 635)
2.3 Genomic resources
2.4 Indexed access to the databanks
- ENTREZ: (DOOL1996, p. 141) WWW Entrez allows you to retrieve molecular biology data and bibliographic
citations from the NCBI's integrated databases. These include:
- DNA sequences from GenBank, EMBL, and DDBJ
- Protein sequences from Swiss-Prot, PIR, PRF, PDB, and translated protein sequences from the DNA sequence databases
- Genome and chromosome mapping data
- Three-dimensional protein structures derived from PDB, and incorporated into NCBI's Molecular Modeling Database (MMDB)
- PubMed bibliographic database containing citations for nearly 9 million biomedical articles from the National Library of Medicine's MEDLINE and pre-MEDLINE databases
- SRS : (DOOL1996,p. 114) Sequence Retrieval System acts on data banks in a flat file or text format. It provides a homogeneous interface to about 80 biological databanks for accessing and queryimg their contents and for navigating among them.
2.5 Genomic tools
- NNPP: NNPP is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The function of the promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as the TATA-box and the transcription start site ("Initiator") for eukaryotes, are known to function as binding sites for Polymerase II, transcription factors, and other proteins that are involved in the transcription initiation process. These promoter elements are present in various combinations separated by various distances in the sequence.
The basis of the NNPP program is a time-delay neural network. The time-delay network consists mainly of two feature layers, one for recognizing the TATA-box and one for recognizing the "Initiator", which is the region spanning the transcription start site. Both feature layers are combined into one output unit, which gives output scores between 0 and 1.
3. Searching Protein Sequence Databases
3.1 Searching by sequence similarity (DOOL1996, p. 212, 227)
- BLAST - NCBI: BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (DOOL1996, p. 131)
- Gapped BLAST - NCBI: The Gapped BLAST algorithm allows gaps (deletions and insertions) to be introduced into the alignments that are returned. Allowing gaps means that similar regions are not broken into several segments. The scoring of these gapped alignments tends to reflect biological relationships more closely.
- PSI-BLAST: In this program, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. It is generally much more sensitive than gapped BLAST
- FASTA : (DOOL1996, p. 227)
- BLITZ-EBI:
Recherche de fragments similaires, avec "gap" et "gap penalty",
par l'utilisation de l'algorithme de Smith-Waterman complet: les alignements
sont optimaux
- SAM-T99:Protein homology search. This server creates a SAM-T99 multiple alignment from a single sequence, builds an HMM from it, and then scores the selected database using the HMM. The SAM-T99 alignment is not returned, but a selected number of hits from the database are returned as a multiple alignment.
Si il existe une similarité importante entre une séquence
de structure inconnue et une séquence de structure connue, on peut
déterminer la structure tridimensionnelle de la première
séquence par homologie (on suppose alors que les deux séquences
ont une origine commune au cours de l'évolution)
Note technique: Lorsque vous utilisez BLAST pour rechercher des séquences
de protéines, il faut choisir l'option blastp
3.2 Searching by structural similarity
- DALI: Comparaison de deux
structures ou recherche, dans PDB, de structures similaires à une
structure donnée; la similarité structurale est présentée
sous la forme d'un alignement (DOOL1996, p. 653)
- SARF2: Idem
DALI mais le programme propose plusieurs alignements possibles par paire
de structures
4. Alignment
4.1 Multiple Sequence Alignment(DOOL1996, p. 343)
- Match-Box: The Match-Box software proposes protein sequence alignment tools based on strict statistical criteria. The method circumvents the gap penalty requirement: in the Match-Box method, gaps are the result of the alignment and not a governing parameter of the matching procedure. A reliability score is provided below each aligned position. The Match-Box program is particularly suitable for finding and aligning conserved structural motives, in particular in protein core
- CLUSTALW:
Alignement multiple progressif (DOOL1996, p. 383)
- MAP:
Idem CLUSTAL
- DIALIGN 2: DIALIGN is a novel alignment method developed by Burkhard Morgenstern et al. While standard alignment programs rely on comparing single residues and imposing gap penalties, DIALIGN constructs pairwise and multiple alignments by comparing whole segments of the sequences. No gap penalty is used. This approach is especially efficient where sequences are not globally related but share only local similarities, as is the case for genomic DNA sequences and for many protein families
- MSA: The MAP program computes a multiple global alignment of sequences using iterative pairwise method. The underlying algorithm for aligning two sequences computes a best overlapping alignment bewteen two sequences without penalizing terminal gaps. In addition, long internal gaps in short sequences are not heavily penalized. So MAP is good at producing an alignment where there are long terminal or internal gaps in some sequences. The MAP program is designed in a space-efficient manner, so long sequences can be aligned.
- BLOCKMAKER: Block Maker finds conserved blocks in a group of two or more unaligned protein sequences, which are assumed to be related, using two different algorithms. At least two protein sequences must be provided to make blocks.
- BCM
Search Launcher: Multiple Sequence Alignments: Page de soumission renvoyant
à plusieurs méthodes
4.2 Pairwise Sequence Alignment
- SIM: SIM finds k best non-intersecting alignments between two sequences or within a sequence using dynamic programming techniques. The alignments are reported in order of decreasing similarity score and share no aligned pairs. SIM requires space proportional to the sum of the input sequence lengths and the output alignment lengths, so it accommodates 100,000-base sequences on a workstation.
- BCM
Search Launcher: Pairwise Sequence Alignment: Page de soumission renvoyant
à plusieurs méthodes
5. Protein Secondary Structure Prediction
- PHD:
Prédiction de structure secondaire, d'accessibilité au solvant,
d'hélice trans-membranaire (HTM) et leur topologie, sur base d'un
alignement multiple (DOOL1996, p. 525)
- Prof:
Prédiction de structure secondaire sur base d'un alignement multiple
- PREDATOR:
Prédiction de structure secondaire - le maximum d'efficacité
est obtenu en installant le programme sur sa station de travail et en préparant
manuellement les données
- NPSA: Network Protein Sequence Analysis, consensus of many prediction methods.
- PSIpred: PSIPRED is a novel and reliable secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST) (Altschul et al., 1997). Using a stringent cross validation method to evaluate the method's performance, PSIPRED is capable of achieving an average Q3 score of nearly 77%. This is the highest result for any published secondary structure prediction method to date. Predictions produced by PSIPRED were also submitted to the CASP3 server and assessed during the CASP3 meeting, which took place in December 1998 at Asilomar. Assessors at CASP3 ranked PSIPRED first out of all of the secondary structure prediction methods evaluated, achieving an average Q3 score of 73.4% over the hardest category and an overall average of 77% across all submitted targets
- TMHMM: Prediction of transmembrane helices in proteins
- BCM
Search Launcher: Protein Secondary Structure Prediction: Page de soumission renvoyant à plusieurs méthodes
6. Protein Tridimensional Structure Prediction
6.1 Homology Modeling
- ESyPred3D:
- SWISS-MODEL:
SWISS-MODEL is an Automated Protein Modelling Server running at the GlaxoWellcome Experimental Research in Geneva, Switzerland. The purpose of this server is to make Protein Modelling accessible to all biochemists and molecular biologists World Wide.
6.2 Fold prediction
Les méthodes de "fold recognition" sont en grand
développement actuellement. Leur fiabilité est encore réduite.
De ce fait, il faut, d'une part, bien respecter les critères d'évaluation
des résultats présentés par les auteurs et, d'autre
part, combiner les résultats de plusieurs méthodes.
- 3D-PSSM
- PSI-BLAST-BORK: Predicting Protein-3D structures based on homologous sequence search using PSI-BLAST on the PDB databank
- Pcons: The first consensus fold recognition method.
- SAM-T98: Search UCSC's protein structure HMM library with a single sequence. May reveal homologous structures
- TOPITS:
"Fold recognition" basé sur la prédiction de structure
secondaire faite par PHD
- UCLA-DOE
Structure Prediction Server: "Fold recognition" par divers
méthodes dont certaines utilisent une prédiction de structure
secondaire préalable (DOOL1996, p.598)
- 123D:
"Fold recognition" par utilisation combinée de matrices
de similarité, de prédiction de structure secondaire et de
"contact capacity potentials"
7. Protein structure assessment or validation
7.1 Atomic level
- ANOLEA: Atomic non-local environment assessment
- PROVE: Protein volume evaluation
7.2 Amino acid level
- WHAT CHECK: Biotech Validation Suite for Protein Structures
- PROSA II: PROtein Structure Analysis
- VERIFY3D: Structure Evaluation server designed to help in the refinement of crystallographic structures
8. Protein sequence analysis
- SAPS: Le programme SAPS évalue selon des critères statistiques un grand nombre de propriétés d'une ou plusieurs séquences protéiques : composition en acides aminés, distribution des charges,
structures répétitives, multiplets, périodicité.
9. Illustrations
10. Acknowledgment
- Guy Baudoux: Thank you for your main contribution to this page. It was your knowledge of available resources in bioinformatics on the web that allows me to write this page and to maintain it.
Last modified 19/11/1998 - Christophe.Lambert@fundp.ac.be