GenBank
Manpreet S. Katari

OUTLINE
References:

    Baxevanis and Oulette, Bioinformatics, Wiley, 1998
    Ch.2 :The GenBank Sequence Database.
GenBank is :
The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research 2000 Jan 1;28(1):15-8).

There are approximately 9,546,000,000 bases in 8,214,000 sequence records as of August 2000.

Year Base Pairs Sequences
1982 680,338 606
1990 49,179,285 39,533
1995 384,939,485 555,694
2000 8,604,221,980 7,077,491
To view graph go here: http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

    GenBank is part of the International Nucleotide Sequence Database Collaboration (1988):
      DNA DataBank of Japan (DDBJ)
      European Molecular Biology Laboratory (EMBL)
      GenBank at NCBI.

Genbank Flat File Format

GenBank Release Notes :

3 Parts:

    Header
    Features
    Sequence

Header:

    LOCUS - A short mnemonic name for the entry. The line contains the Accession number, length of molecule, type of molecule (DNA or RNA), a three letter reference to possibly Taxonomy, and the date that the data was made public.
    DEFINITION - A concise description of the sequence.
    ACCESSION - The primary accession number is a unique, unchanging code assigned to each entry. Used often when citing sequence in journals
    VERSION - The primary accession number and a numeric version number associated with the current version of the sequence data in the record. This is followed by an integer key (a "GI") assigned to the sequence by NCBI.
    KEYWORDS - Short phrases describing gene products and other information about an entry.
    SOURCE - Common name of the organism or the name most frequently used in the literature.
    ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines).
    REFERENCE - Citations for all articles containing data reported in this entry.
    AUTHORS - Lists the authors of the citation.
    TITLE - Full title of citation.
    JOURNAL - Lists the journal name, volume, year, and page numbers of the citation.
    MEDLINE - Provides the Medline unique identifier for a citation.
    PUBMED - Provides the PubMed unique identifier for a citation.
    REMARK - Specifies the relevance of a citation to an entry.
    COMMENT - Cross-references to other sequence entries, comparisons to other collections, notes of changes in LOCUS names, and other remarks.

Features:

    SOURCE: contains information about organism, mapping, chromosome, tissue alignment, clone identification.
    CDS: instructions on how to join sequences together to make an amino acid sequence from the given coordinates. Includes cross references to other databases.
    GENE Feature: a segment of DNA identified by a name.
    RNA Feature: used to annotate RNA on genomic sequence (for example: mRNA, tRNA, rRNA)

Sequence:

    The entire sequence.

GenBank Flat File Example

FASTA FORMAT

The most simple and widely used by all softwares designed for molecular biology.

>Description Line
    The Entire Seq ....
    ........
    ........

Fasta Example

ASN.1 (Abstract Syntax Notation 1) Format

A type of format that is not easy to read by eye, more for machines.
One can download the entire GenBank database in this format

ASN.1 Example

GenBank Submission Tools

BankIt : HTML bases submission tool.

Sequin : A stand alone sequence submission tool that runs on PC, MAC and Unix.

Anyone can submit any sequence to GenBank.

Archival database vs. Curated Database

    GenBank is Archival, no one to check sequences submitted
    SWISSPROT is a curated database, proteins submitted are checked with published data.

Basic Local Alignment Search Tool

NCBI's similarity search tool. It calculates similarity between an input sequence to either protein or nucleotide database, depending on the search you want to perform.

Program also determines the statistical significance of the output. Since the size of the database increases frequently, the statistical significance of one match may change in time.

Some databases that are available to Blast against:

    Peptide Sequence Databases:
      nr - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
      swissprot - Last major release of the SWISS-PROT protein sequence database (no updates)
    Nucleotide Sequence Databases:
      nr - All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".
      Dbest - Database of GenBank+EMBL+DDBJ sequences from EST Divisions
      Htgs - Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 (finished, phase 3 HTG sequences are in nr)

GenBank Retrieval Methods

Can use Entrez to search the website (explained below)
    -- Entrez also allows batch retrievals.
Can use tools such as BIO-PERL to retrieve data without manually visiting the website. (explained later by Sean)


Entrez

Entrez is a retrieval system for searching several linked databases. Databases include:
    PubMed: The biomedical literature (PubMed)
    Nucleotide sequence database: includes sequences from Genbank, EMBL, and DDBJ.
    Protein sequence database: includes sequences from translated coding regions in the nucleotide database, and proteins submitted to PIR, SWISSPROT, PRF, and Protein DataBank (PDB)
    Structure: three-dimensional macromolecular structures. MMDB (Molecular Modeling Database) contains data about crystallography and NMR specification. This data is available from PDB.
    Genome: complete genome assemblies. Provides views for a variety of genomes, complete chromosomes, contiged sequence maps, and integrated genetic and physical maps.
    PopSet: Population study data sets. Contains sequences that are submitted from a study describing either Evolution or Population Variation.
    Taxonomy: organisms in GenBank
    OMIM: Online Mendelian Inheritance in Man

Examples of File Formats

GenBank

LOCUS       AF067844   218336 bp    DNA             PRI       08-FEB-1999
DEFINITION  Homo sapiens chromosome 10 clone PTEN, complete sequence.
ACCESSION   AF067844
VERSION     AF067844.1  GI:4240386
KEYWORDS    HTG.
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 218336)
  AUTHORS   Jensen,K., de la Bastide,M., Parsons,R., Parnell,L.D., Dedhia,N.,
            Gottesman,T., Gnoj,L., Kaplan,N., Lodhi,M., Johnson,A.F.,
            Shohdy,N., Hasegawa,A., Haberman,K., Huang,E.N., Schutz,K.,
            Calma,C., Granat,S., Wigler,M. and McCombie,W.R.
  TITLE     Genomic sequence of PTEN/MMAC1
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 218336)
  AUTHORS   Jensen,K., de la Bastide,M., Parsons,R., Parnell,L.D., Dedhia,N.,
            Gottesman,T., Gnoj,L., Kaplan,N., Lodhi,M., Johnson,A.F.,
            Shohdy,N., Hasegawa,A., Haberman,K., Huang,E.N., Schutz,K.,
            Calma,C., Granat,S., Wigler,M. and McCombie,W.R.
  TITLE     Direct Submission
  JOURNAL   Submitted (18-MAY-1998) Lita Annenberg Hazen Genome Sequencing
            Center, Cold Spring Harbor Laboratory, 1 Bungtown Rd., Cold Spring
            Harbor, NY 11724, USA
FEATURES             Location/Qualifiers
     source          1..218336
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="10"
                     /clone="PTEN"
     source          1..106991
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="10"
                     /clone="BAC 265N13"
     5'UTR           22308..23338
                     /gene="PTEN"
                     /note="5'-UTR defined by comparison to PTEN cDNA U93051"
     mRNA            join(22308..23417,51995..52079,83482..83526,89015..89058,
                     90987..91225,110086..110227,115821..115987,118862..119086,
                     123258..124345)
                     /gene="PTEN"
                     /note="mRNA coordinates delineated by comparison to PTEN
                     cDNA U93051"
     gene            22308..124345
                     /gene="PTEN"
                     /note="the coding region of PTEN, as defined by the cDNA,
                     identifies 9 exons within this region; identical to MMAC1
                     (U92346) and PTEN (U93051)"
                     /evidence=experimental
     exon            22308..23417
                     /gene="PTEN"
                     /function="5'-UTR and initial segment of the CDS"
                     /number=1
     CDS             join(23339..23417,51995..52079,83482..83526,89015..89058,
                     90987..91225,110086..110227,115821..115987,118862..119086,
                     123258..123443)
                     /gene="PTEN"
                     /note="coding regions delineated by comparison to PTEN
                     cDNA"
                     /codon_start=1
                     /product="PTEN"
                     /protein_id="AAD13528.1"
                     /db_xref="GI:4240387"
                     /translation="MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEG
                     VYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIK
                     PFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRT
                     RDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVC
                     QLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVN
                     TFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKAN
                     RYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPF
                     DEDQHTQITKV"
     exon            51995..52079
                     /gene="PTEN"
                     /number=2
     source          58169..218336
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="10"
                     /clone="BAC 60C5"
     exon            83482..83526
                     /gene="PTEN"
                     /number=3
     exon            89015..89058
                     /gene="PTEN"
                     /number=4
     exon            90987..91225
                     /gene="PTEN"
                     /number=5
     exon            110086..110227
                     /gene="PTEN"
                     /number=6
     exon            115821..115987
                     /gene="PTEN"
                     /number=7
     exon            118862..119086
                     /gene="PTEN"
                     /number=8
     exon            123258..124345
                     /gene="PTEN"
                     /function="terminal segment of the CDS and 3'-UTR"
                     /number=9
     3'UTR           123444..124345
                     /gene="PTEN"
                     /note="3'-UTR defined by comparison to PTEN cDNA U93051"
                     /evidence=experimental
BASE COUNT    64194 a  39437 c  43295 g  71406 t      4 others
ORIGIN      
        1 caagctttac actagagcct atatgaagtt ttgattctaa gtgttaatgt accttctgac
       61 aactgtgaaa tgaaccttgt tcctggggag cgcgttctgg ttttctcttt gcacagttaa
      121 gctgagacta gcatcattct agtttgcagg tgacattctc tgggaagcta gtctatgggg
      181 gagatgacat cttctgaacc tagtccccac agagaacttt gaatgagtgg aatcaagagg
      241 ttgcctgcat tcttgctcat gtcacaatgc tggacatgtg acttcagaga agcatgtgcc
      301 aggtcaatat gattgggctg ttctcacaat acaaggcctt gaccatagag tgattcagag
      361 gcaaatgcag ccttcttaga ctcttaacca aaacattggc atgacataaa attataatta
      421 ataaaagata tacagttatt tcaaaagtac cgttttattg ggacatctca aaggactaag
      481 aaaatgttta ttttcttatc tcctatcttt tgttaatagc tgttcatcgc tcatcagcct
      541 ttactgaaag cttatcatgt atcaaacaat atgccaggtg tcagagaggg cagcaaagag
      601 agtacaattg agttagatag agtacctgca ctcaataata ataacagcta acacttacat
      661 agtgctttct gcgtgccagg cttgtcctaa gtgattttac acacacacac acacacacac
      721 acacacacac acacacactc cctcactcag tccttataaa aacccactga taggccgggt
      781 gcggtggctc atacctgtaa tcccagcaac tttgggaggc tgaagcaggc agatcacttg
      841 aggtcaggag ttcgagatca ccctggccaa catggtgaaa cctcatctct actaaaaata
      901 caaaaattaa ccaagcatgg tggcaggtgc ctgtaatcct agctactcaa gaggctgaga
      961 caggaaaatc acttgaacct ggtaggtgga tgttgcagtg tgccgagatc gtgccaccac
     1021 actccagcct gagcaacaga gtgagactct atctaaaaaa aaaaaaaaaa aaattaaaaa
     
   217861 ggatacggtg gtgtaaaagg caaaacatat acctgatttc atggaactca cattctaggg
   217921 gtggtttgtg tatatatgag aacagtaact agaaaaaaat aatgaacaag gtattttatg
   217981 taacgataag agctatgaag aaaatcagac atgacgattt tcagctagag ctacccaaag
   218041 catgatcttt gagtcaacaa caacatatga gcaatcagtt tgttaaaaat gcagaatctc
   218101 agaagacggc ctagacctac tgattcagaa tcatcattgt aacaggatcc ccttgtcatt
   218161 tctttgcatg ctaatgtttg agaagcactg agctagacag tgggaaatgg aaggtttctc
   218221 tgcctaggtg acatctgagc tgagacttga atgaagaaaa gctgtccatg taaagatctg
   218281 ggagcagaag gatccaggca gaggaaatgg aaagtacaag gggctggatg agagaa
//

FASTA

>gi|4240386|gb|AF067844.1|AF067844 Homo sapiens chromosome 10 clone PTEN, complete sequence
CAAGCTTTACACTAGAGCCTATATGAAGTTTTGATTCTAAGTGTTAATGTACCTTCTGACAACTGTGAAA
TGAACCTTGTTCCTGGGGAGCGCGTTCTGGTTTTCTCTTTGCACAGTTAAGCTGAGACTAGCATCATTCT
AGTTTGCAGGTGACATTCTCTGGGAAGCTAGTCTATGGGGGAGATGACATCTTCTGAACCTAGTCCCCAC
AGAGAACTTTGAATGAGTGGAATCAAGAGGTTGCCTGCATTCTTGCTCATGTCACAATGCTGGACATGTG
ACTTCAGAGAAGCATGTGCCAGGTCAATATGATTGGGCTGTTCTCACAATACAAGGCCTTGACCATAGAG
TGATTCAGAGGCAAATGCAGCCTTCTTAGACTCTTAACCAAAACATTGGCATGACATAAAATTATAATTA
ATAAAAGATATACAGTTATTTCAAAAGTACCGTTTTATTGGGACATCTCAAAGGACTAAGAAAATGTTTA
TTTTCTTATCTCCTATCTTTTGTTAATAGCTGTTCATCGCTCATCAGCCTTTACTGAAAGCTTATCATGT
ATCAAACAATATGCCAGGTGTCAGAGAGGGCAGCAAAGAGAGTACAATTGAGTTAGATAGAGTACCTGCA
CTCAATAATAATAACAGCTAACACTTACATAGTGCTTTCTGCGTGCCAGGCTTGTCCTAAGTGATTTTAC
ACACACACACACACACACACACACACACACACACACACTCCCTCACTCAGTCCTTATAAAAACCCACTGA
TAGGCCGGGTGCGGTGGCTCATACCTGTAATCCCAGCAACTTTGGGAGGCTGAAGCAGGCAGATCACTTG
AGGTCAGGAGTTCGAGATCACCCTGGCCAACATGGTGAAACCTCATCTCTACTAAAAATACAAAAATTAA
CCAAGCATGGTGGCAGGTGCCTGTAATCCTAGCTACTCAAGAGGCTGAGACAGGAAAATCACTTGAACCT
GGTAGGTGGATGTTGCAGTGTGCCGAGATCGTGCCACCACACTCCAGCCTGAGCAACAGAGTGAGACTCT
ATCTAAAAAAAAAAAAAAAAAAATTAAAAACCCAATGAGGTGGCTACTGTTATCATCCCCATTTTACGGA
TGAGGACATGGGTACATAGAGATTAAGTAACTTGCCAAAGATCTCACAACTGGTAAGTGGCAGAGCAAAA
TTTGAAAACAAACAATCTGGTTCCAGAAACTGTACTTTTAACCTCATGATAGCTTCCTGAGGAATTTATG
ATCTGAGTATATATAGTAAGTACCTCCCCTTTCAGGGTAAGGCAGTAGGTAATGGTGAACAGGGAAGCAA
AAGGTGACTCAGGTTGAGTAAACAACACCAAGCATATCTGACTCAAGGAATGCTTCAGAGGCCAGGGGTG
CATGCCTGTAATCCCAGCACCTTGGAAGGCTGACACAGGAGGATCACTGGAGCCCAAGTTCAAGACCAGC

ASN.1

Seq-entry ::= set {
  class nuc-prot ,
  descr {
    source {
      genome genomic ,
      org {
        taxname "Homo sapiens" ,
        common "human" ,
        db {
          {
            db "taxon" ,
            tag
              id 9606 } } ,
        orgname {
          name
            binomial {
              genus "Homo" ,
              species "sapiens" } ,
          lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
 Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo" ,
          gcode 1 ,
          mgcode 2 ,
          div "PRI" } } ,
      subtype {
        {
          subtype chromosome ,
          name "10" } ,
        {
          subtype clone ,
          name "PTEN" } } } ,
    pub {
      pub {
        sub {
          authors {
            names
              std {
                {
                  name
                    name {
                      last "Jensen" ,
                      first "Kendall" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "de la Bastide" ,
                      first "Melissa" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "Parsons" ,
                      first "Ramon" ,
                      initials "R." } } ,
                {
                  name
                    name {
                      last "Parnell" ,
                      first "Laurence" ,
                      initials "L.D." } } ,
                {
                  name
                    name {
                      last "Dedhia" ,
                      first "Neilay" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Gottesman" ,
                      first "Tina" ,
                      initials "T." } } ,
                {
                  name
                    name {
                      last "Gnoj" ,
                      first "Lidia" ,
                      initials "L." } } ,
                {
                  name
                    name {
                      last "Kaplan" ,
                      first "Nancy" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Lodhi" ,
                      first "Muhammad" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "Johnson" ,
                      first "Arthur" ,
                      initials "A.F." } } ,
                {
                  name
                    name {
                      last "Shohdy" ,
                      first "Nadim" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Hasegawa" ,
                      first "Amy" ,
                      initials "A." } } ,
                {
                  name
                    name {
                      last "Haberman" ,
                      first "Kristina" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "Huang" ,
                      first "Emily" ,
                      initials "E.N." } } ,
                {
                  name
                    name {
                      last "Schutz" ,
                      first "Kristin" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "Calma" ,
                      first "Christopher" ,
                      initials "C." } } ,
                {
                  name
                    name {
                      last "Granat" ,
                      first "Susan" ,
                      initials "S." } } ,
                {
                  name
                    name {
                      last "Wigler" ,
                      first "Michael" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "McCombie" ,
                      first "W Richard" ,
                      initials "W.R." } } } ,
            affil
              std {
                affil "Cold Spring Harbor Laboratory" ,
                div "Lita Annenberg Hazen Genome Sequencing Center" ,
                city "Cold Spring Harbor" ,
                sub "NY" ,
                country "USA" ,
                street "1 Bungtown Rd." ,
                postal-code "11724" } } ,
          medium other ,
          date
            std {
              year 1998 ,
              month 5 ,
              day 18 } } } } ,
    pub {
      pub {
        gen {
          cit "unpublished" ,
          authors {
            names
              std {
                {
                  name
                    name {
                      last "Jensen" ,
                      first "Kendall" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "de la Bastide" ,
                      first "Melissa" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "Parsons" ,
                      first "Ramon" ,
                      initials "R." } } ,
                {
                  name
                    name {
                      last "Parnell" ,
                      first "Laurence" ,
                      initials "L.D." } } ,
                {
                  name
                    name {
                      last "Dedhia" ,
                      first "Neilay" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Gottesman" ,
                      first "Tina" ,
                      initials "T." } } ,
                {
                  name
                    name {
                      last "Gnoj" ,
                      first "Lidia" ,
                      initials "L." } } ,
                {
                  name
                    name {
                      last "Kaplan" ,
                      first "Nancy" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Lodhi" ,
                      first "Muhammad" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "Johnson" ,
                      first "Arthur" ,
                      initials "A.F." } } ,
                {
                  name
                    name {
                      last "Shohdy" ,
                      first "Nadim" ,
                      initials "N." } } ,
                {
                  name
                    name {
                      last "Hasegawa" ,
                      first "Amy" ,
                      initials "A." } } ,
                {
                  name
                    name {
                      last "Haberman" ,
                      first "Kristina" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "Huang" ,
                      first "Emily" ,
                      initials "E.N." } } ,
                {
                  name
                    name {
                      last "Schutz" ,
                      first "Kristin" ,
                      initials "K." } } ,
                {
                  name
                    name {
                      last "Calma" ,
                      first "Christopher" ,
                      initials "C." } } ,
                {
                  name
                    name {
                      last "Granat" ,
                      first "Susan" ,
                      initials "S." } } ,
                {
                  name
                    name {
                      last "Wigler" ,
                      first "Michael" ,
                      initials "M." } } ,
                {
                  name
                    name {
                      last "McCombie" ,
                      first "W Richard" ,
                      initials "W.R." } } } ,
            affil
              std {
                affil "Cold Spring Harbor Laboratory" ,
                div "Lita Annenberg Hazen Genome Sequencing Center" ,
                city "Cold Spring Harbor" ,
                sub "NY" ,
                country "USA" ,
                street "1 Bungtown Rd" ,
                postal-code "11724" } } ,
          title "Genomic sequence of PTEN/MMAC1" } } } ,
    update-date
      std {
        year 1998 ,
        month 6 ,
        day 18 } ,
    create-date
      std {
        year 1999 ,
        month 2 ,
        day 8 } } ,
  seq-set {
    seq {
      id {
        local
          str "HsPTEN.genomic" ,
        genbank {
          name "AF067844" ,
          accession "AF067844" ,
          version 1 } ,
        gi 4240386 } ,
      descr {
        molinfo {
          biomol genomic ,
          tech htgs-3 ,
          completeness complete } } ,
      inst {
        repr raw ,
        mol dna ,
        length 218336 ,
        seq-data