The primary protein sequence databases are:
1. UniProt
2. SwissProt and TrEMBL (UniProt Consortium)
3. Protein information resources (PIR)
UniProt
Universal Protein Resource (UniProt) plays an increasingly important role by providing a stable, comprehensive, freely accessible central resource on protein sequences and functional annotation. UniProt is produced by the UniProt Consortium, formed in 2002 by the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB).
The UniProt Consortium and the individual activities
EBI, located at the Wellcome Trust Genome Campus in
Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in
Geneva, Switzerland, maintains the ExPASy (Expert Protein Analysis System) servers that are a central resource for proteomics tools and databases. PIR, hosted by the National Biomedical Research Foundation (NBRF) at the
Georgetown University Medical Center in
Washington, DC, USA, is heir to the oldest protein sequence database.
Major components of UniProt
The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, and development of a user-friendly UniProt web site and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of three major components, each optimized for different uses: the UniProt Archive, the
UniProt Knowledgebase and the UniProt Reference Clusters. An additional component consisting of metagenomic and environmental sequences has recently been added to UniProt to ensure availability of such sequences in a timely fashion. UniProt is updated and distributed on a bi-weekly basis and can be accessed online for searches or download at http://www.uniprot.org.
UniProtKB
UniProtKB consists of two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The former contains manually annotated records with information extracted from literature and curator-evaluated computational analysis. To achieve accuracy, annotations are performed by biologists with specific expertise. Information including function, catalytic activity, subcellular location, disease, structure and posttranslational modifications is annotated. An important part of the annotation process involves the merging of different reports for a single protein.
UniRef
The UniRef databases provide three clustered sets (UniRef100, 90 and 50) of sequences from UniProtKB and selected UniParc records in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences from view.
UniParc
UniParc is the main sequence storehouse and is a comprehensive repository that reflects the history of all protein sequences. UniParc houses all new and revised protein sequences from various sources to ensure that complete coverage is available at a single site. It includes not only UniProtKB but also translations from the EMBL-Bank/ DDBJ/GenBank Nucleotide Sequence Databases, the Ensembl database of animal genomes, the International Protein Index (IPI), the Protein Data Bank (PDB), NCBI’s Reference Sequence Collection (RefSeq), model organism databases FlyBase and WormBase and protein sequences from the European, American and Japanese Patent Offices. To avoid redundancy, sequences are handled as strings—all sequences 100% identical over the entire length are merged, regardless of source organism. New and updated sequences are loaded on a daily basis, cross-referenced to the source database accession number and provided with a sequence version that increments upon changes to the underlying sequence.
SWISS-PROT
SWISS-PROT is a protein sequence and knowledge database that is valued for its high quality annotation, the usage of standardized nomenclature, direct links to specialized databases and minimal redundancy. The format of SWISSPROT follows as closely as possible that of the EMBL Nucleotide Sequence Database for standardization purposes.
Core data and annotation
(The following paragraph can be written for explanation of SWISS-PROT format).
The core data, which is mandatory to each SWISS-PROT entry, consists principally of the amino acid sequence, the protein name (description), taxonomic data and citation information. If further information on the protein is available, the entries contain detailed annotation on items such as the function(s) of the protein, enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulation mechanisms), biologically relevant domains and sites, posttranslational modification(s), molecular weight determined by mass spectrometry, subcellular location(s) of the protein, tissue-specific expression, developmentally-specific expression of the protein, secondary structure, quaternary structure, splice isoform(s), polymorphism(s), similarities to other proteins, use of the protein in a biotechnological process, diseases associated with deficiencies in the protein, use of the protein as a pharmaceutical drug, sequence conflicts, etc. To acquire a maximum of up-to-date knowledge regarding a protein, information is not only obtained from publications reporting new sequence data, but also from review articles with an aim to revise periodically the annotations of families or groups of proteins.
Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. SWISS-PROT tries to merge all these data in order to minimize the redundancy of the database. Differences between sequencing reports due to splice variants, polymorphisms, disease-causing mutations, experimental sequence modifications or simply sequencing errors are indicated in the feature table of the corresponding SWISS-PROT entry. Splice isoforms may differ considerably from one another, with potentially less than 50% sequence similarity between isoforms.
TrEMBL: A computer-annotated supplement to SWISS-PROT
Why TrEMBL ?
Due to the increased data flow from genome projects to the sequence databases, the SWISS-PROT protein knowledgebase faced a number of challenges in its time- and labor-intensive way of manual database annotation. While it is necessary to maintain the high annotation quality as described above, it is also vital to make sequences available as quickly as possible. To address this, TrEMBL (translation of EMBL nucleotide sequence database) was introduced in 1996.
TrEMBL consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the nucleotide sequence databases, except for CDS already included in SWISS-PROT. It also contains protein sequences extracted from the literature and protein sequences submitted directly by the user community.
Sections of TrEMBL
It is subdivided into two sections: SP-TrEMBL contains sequences, which will eventually be incorporated into SWISS-PROT and REM-TrEMBL contains those, which will not. These include immunoglobulins and T-cell receptors, synthetic sequences, patent application sequences, fragments of less than 8 amino acids and coding sequences where there is strong experimental evidence that the sequence does not code for a real protein.
In addition, there is a weekly update to TrEMBL called TrEMBLnew. TrEMBLnew is produced weekly from new nucleotide sequences deposited in the EMBL nucleotide sequence database. At each TrEMBL release, the TrEMBLnew entries are processed; any entries redundant against SWISS-PROT/TrEMBL are merged and the remainder then progressed into TrEMBL.
Protein Information Resource (PIR)
The Protein Information Resource (PIR), located at Georgetown University Medical Center (GUMC), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies. PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. In 2002, PIR along with its international partners, EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics), were awarded a grant from NIH to create UniProt, a single worldwide database of protein sequence and function, by unifying the PIR-PSD, Swiss-Prot, and TrEMBL databases.
PIR-PSD
The PIR, along with the
Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), continues to enhance and distribute the PIR-International Protein Sequence Database (PSD), a non-redundant, expertly annotated, fully classified and extensively cross-referenced protein sequence database in the public domain. It contains about 250 000 protein sequences with comprehensive coverage across the entire taxonomic range, including sequences from all the publicly available complete genomes.
Architecture of PIR-PSD
a. Superfamily classification
A unique characteristic of the PIR-PSD is the superfamily/family classification that provides complete and non-overlapping clustering of proteins based on global (end-to-end) sequence similarity. Sequences in the same superfamily share common domain architecture (i.e. have the same number, order and types of domains) and do not differ excessively in overall length.
b. Bibliography submission and literature mapping
Linking protein data to literature data that describes or characterizes the proteins is crucial for us to increase the amount of experimentally verified data and to improve the quality of protein annotation. Attribution of protein annotations to validated experimental sources provides effective means to avoid propagation of errors that may have resulted from large-scale genome annotation.
iProClass (integrated Protein Classification)
The iProClass (integrated Protein Classification) database is designed to provide comprehensive descriptions of all proteins and to serve as a framework for data integration in a distributed networking environment. The database describes family relationships at both global (whole protein) and local (domain, motif, site) levels, as well as structural and functional classifications and features of proteins. The current version consists of more than 270 000 non-redundant PIR-PSD and SWISS-PROT proteins organized with more than 33 000 PIR superfamilies, 100 000 families, 3400 PIR homology and Pfam domains, 1300 ProClass/ProSite motifs, 280 PIR post-translational modification sites, and links to over 40 databases of protein families, structures, functions, genes, genomes, literature and taxonomy.
Directly linked to the iProClass sequence report are two additional PIR databases, ASDB and RESID. PIR-Annotation and Similarity Database (ASDB) lists pre-computed, biweekly updated FASTA neighbors of all PSD sequences with annotation information and graphical displays of sequence similarity matches. PIR-RESID documents over 280 post-translational modifications and links to PSD entries containing either experimentally determined or computationally predicted modifications with evidence tags.
PIR-NREF
As a major resource of protein information, one of our primary aims is to provide a timely and comprehensive collection of all protein sequence data that keeps pace with the genome sequencing projects and contains source attribution and minimal redundancy. The PIR-NREF protein database includes sequences from PIR, SWISS-PROT, TrEMBL, RefSeq, GenPept, PDB and other protein databases. The NCBI taxonomy is used as the ontology for matching source organism names at the species or strain (if known) levels. The NREF report provides source attribution (containing protein IDs, accession numbers and protein names from underlying databases), in addition to taxonomy, amino acid sequence and composite literature data. The composite protein names, including synonyms, alternate names and even misspellings, can be used to assist the ontology development on protein names and the identification of mis-annotated proteins. Related sequences, including identical sequences from different organisms and closely related sequences within the same organism, are also listed. The database presently consists of about 800 000 entries (Jan 2002) and is updated biweekly.