From the genome sequencing data the majority of proteins translated from
predicted open reading frames have no sequence similarity to any existing
proteins. In these cases the proteins remain "hypothetical". It should be
noted here that we analyze these sequences by a number of programs so that
we can at least add some potential information, rather than having just an
entry containing submission and sequence data. Again, in these cases, care
is taken to show that this information is potential so that it cannot be
mixed up with data from classified proteins.

The features we currently look for are signal sequences, transmembrane
regions, coiled coil domains and a number of conserved domains described
in PROSITE and/or pfam.


a) Signal sequence prediction

We make use of a program based on the von Heijne method (Nucleic Acids Res. 
14:4683-4690(1986)). The result in the entry is of the type:

FT   SIGNAL        1      x       POTENTIAL.
FT   CHAIN         x      y       


b) Transmembrane region prediction

We make use of a program that is based on four different methods:

  1. Eisenberg, Schwarz, Komaromy and Wall (J. Mol. Biol. 179:125-142(1984));
  2. Rao and Argos (Eur. J. Biochem. 128:565-575(1982));
  3. Klein, Kanehisa and DeLisi (Biochim. Biophys. Acta 815:468-476(1985));
  4. MEMSAT (Jones, Taylor and Thornton) (Biochemistry 33:3038-3049(1994)).

When methods 1 and 4 show the protein to have hydrophobic regions, the
transmembrane regions are added according to the results of method 1. The
result is:

FT   TRANSMEM      x      y       POTENTIAL.


c) Coiled coil prediction

We make use of a program based on the algorithm of Lupas et al (Science
252:1162-1164(1991)) that predicts coiled coil regions within the sequence.
A positive result of this program is:

FT   DOMAIN        x      y       COILED COIL (POTENTIAL).


d) PROSITE

PROSITE (http://www.expasy.ch/prosite/), the database of protein domains and
families, plays a very big role in the addition of features in SWISS-PROT
entries, especially when no other information is available for the sequence.
Where patterns are matched this can lead to the addition of comment lines,
keywords, features either individually or in any combination. As an example:

ID   YA4B_SCHPO     STANDARD;      PRT;   411 AA.
AC   Q09728;
DT   01-NOV-1995 (Rel. 32, Created)
DT   01-NOV-1995 (Rel. 32, Last sequence update)
DT   01-NOV-1995 (Rel. 32, Last annotation update)
DE   PUTATIVE METAL-BINDING REGULATORY PROTEIN C31A2.11C IN CHROMOSOME I.
GN   SPAC31A2.11C.
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Ascomycota; Archiascomycetes;
OC   Schizosaccharomycetales; Schizosaccharomycetaceae;
OC   Schizosaccharomyces.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=972;
RA   Devlin K., Churcher C.M., Barrell B.G., Rajandream M.A., Walsh S.V.;
RL   Submitted (JUL-1995) to the EMBL/GenBank/DDBJ databases.
CC   -!- SUBCELLULAR LOCATION: NUCLEAR (POTENTIAL).
CC   -!- SIMILARITY: CONTAINS A "COPPER-FIST" DNA-BINDING DOMAIN.
DR   EMBL; Z50113; CAA90469.1; -.
DR   PFAM; PF00649; Copper-fist; 1.
DR   PRINTS; PR00617; COPPERFIST.
DR   PROSITE; PS01119; COPPER_FIST_1; 1.
DR   PROSITE; PS50073; COPPER_FIST_2; 1.
KW   Hypothetical protein; Transcription regulation; DNA-binding;
KW   Copper; Nuclear protein.
FT   DNA_BIND      1     40       COPPER-FIST.
SQ   SEQUENCE   411 AA;  45472 MW;  DB18E877B65D5699 CRC64;
     MVVINNVKMA CMKCIRGHRS STCKHNDREL FPIRPKGRPI SQCEKCRIAR ITRHLHVKCT
     CNSRKKGSKC STSSTTDLDS SSASNSSCSI PSSISEKLLP RDNVKTHCPK RSASCCGKKP
     DVMPLKINLE SQTDFMGMPL QSQRPHSESY RMLPEPEKFK SEYGYPSQFL PIEKLTSNVA
     YPPNYNNYLK SPYQQPTNFP PEIQYNYSHS PQHSIQEAEE AAVYGPPVYR SGYQILYNNN
     TDSIAAAAAT HDLYPQPDVP LTFAMLADGN YVPLPSSTNT YGPSNSYGYE ININESTNHV
     DSSYLPHPIQ LSNYFTLPSS CAQADAACQC GDNCECLGCL THPNNATTLA ALNHISALEK
     ETISHTDLHH TFKHEVNSSN NYELTNDELA ASSPLYTSSS VPPSHITTGS T
//

In the above example note the PROSITE patterns represented in the DR lines.
These matches have helped in the addition of the similarity comment and the
copper-fist feature to the feature table. Note also the keywords,
specifically "Copper" directly linked to the copper fist PROSITE patterns.

We have a method that automatically annotates a number of sites or domains
using PROSITE patterns. All features copied into the feature table by using
facility are closely assessed to ensure that they are valid for the
particular sequence from that particular organism.

Listed below are examples of just some of the features currently implemented.

 FT   CARBOHYD  POTENTIAL.
          Note: N-glycosylation only. O-linked sugars are added manually
          (only to eukaryotes).
 FT   NP_BIND   ATP (POTENTIAL).
          Note: is also used for the first of the three annotated NP_BIND
          elements in GTP-binding proteins, where ATP is changed to GTP
          manually.
 FT   CA_BIND   POTENTIAL.
          Note: EF-hand type.
 FT   ZN_FING   C2H2-TYPE.
 FT   ZN_FING   ZN-RIBBON.
 FT   BINDING   PHOSPHOPANTETHEINE (POTENTIAL).
 FT   DNA_BIND  HMG BOX.
 FT   DNA_BIND  HOMEOBOX.
 FT   DNA_BIND  ZN(2)-CYS(6), FUNGAL-TYPE.
 FT   DOMAIN    RNA-BINDING (RNP1) (POTENTIAL).
 FT   DOMAIN    HMA.
 FT   SITE      DEAD BOX.
 FT   REPEAT    BIR REPEAT.


e) Pfam

Pfam (http://www.sanger.ac.uk/Software/Pfam/) is a large collection of
multiple sequence alignments and hidden Markov models covering many common
protein domains. Great use is made of this database, in conjunction with
PROSITE, for the automatic addition of annotation to TrEMBL entries. It also
provides important information for the curators as they begin to annotate
TrEMBL entries by highlighting the type of domain the sequence has.

Please note that when there is a modification or a binding event, "potential"
is added to show that these have not been determined experimentally. Below
is an example of such cases.

ID   YA9A_SCHPO     STANDARD;      PRT;   530 AA.
AC   Q09788;
DT   01-NOV-1995 (Rel. 32, Created)
DT   01-NOV-1995 (Rel. 32, Last sequence update)
DT   01-NOV-1995 (Rel. 32, Last annotation update)
DE   HYPOTHETICAL 54.2 KDA SERINE-RICH PROTEIN C13G6.10C IN CHROMOSOME I
DE   PRECURSOR.
GN   SPAC13G6.10C.
OS   Schizosaccharomyces pombe (Fission yeast).
OC   Eukaryota; Fungi; Ascomycota; Archiascomycetes;
OC   Schizosaccharomycetales; Schizosaccharomycetaceae;
OC   Schizosaccharomyces.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=972;
RA   Odell C., Bowman S., Barrell B.G., Rajandream M.A., Walsh S.V.;
RL   Submitted (SEP-1995) to the EMBL/GenBank/DDBJ databases.
DR   EMBL; Z54308; CAA91103.1; -.
KW   Hypothetical protein; Signal.
FT   SIGNAL        1     18       POTENTIAL.
FT   CHAIN        19    530       HYPOTHETICAL PROTEIN C13G6.1OC.
FT   CARBOHYD     55     55       POTENTIAL.
FT   CARBOHYD    120    120       POTENTIAL.
FT   CARBOHYD    128    128       POTENTIAL.
SQ   SEQUENCE   530 AA;  54210 MW;  1C6A0261F63DFF02 CRC64;
     MRTTFATVAL AFLSTVGALP YAPNHRHHRR DDDGVLTVYE TILETVYVTA VPGANSSSSY
     TSYSTGLASV TESSDDGAST ALPTTSTESV VVTTSAPAAS SSATSYPATF VSTPLYTMDN
     VTAPVWSNTS VPVSTPETSA TSSSEFFTSY PATSSESSSS YPASSTEVAS SYSASSTEVT
     SSYPASSEVA TSTSSYVAPV SSSVASSSEI SAGSATSYVP TSSSSIALSS VVASASVSAA
     NKGVSTPAVS SAAASSSAVV SSVVSSATSV AASSTISSAT SSSASASPTS SSVSGKRGLA
     WIPGTDLGYS DNFVNKGINW YYNWGSYSSG LSSSFEYVLN QHDANSLSSA SSVFTGGATV
     IGFNEPDLSA AGNPIDAATA ASYYLQYLTP LRESGAIGYL GSPAISNVGE DWLSEFMSAC
     SDCKIDFIAC HWYGIDFSNL QDYINSLANY GLPIWLTEFA CTNWDDSNLP SLDEVKTLMT
     SALGFLDGHG SVERYSWFAP ATELGAGVGN NNALISSSGG LSEVGEIYIS
//

Extradom.txt

This file outlines the nomenclature proposal for domains (or modules) found
mainly in extracellular proteins of higher eukaryotes. It shows the standard
nomenclature applied to these classified domains in SWISS-PROT entries. It
can be found via the Web at http://www.expasy.ch/cgi-bin/lists?extradom.txt.
It is one of numerous documents (all of which are visible from:
http://www.expasy.ch/sprot/sp-docu.html) that are distributed with SWISS-
PROT.



back to building SwissProt

back to main