1. GenBank: The Nucleotide Sequence Database
]"6<"1) Ilene Mizrachi
.YRSd Created: October 9, 2002
6 -}gqkR Updated: August 22, 2007
uX[
"w| Summary
k~ue^^r} The GenBank sequence database is an annotated collection of all publicly available nucleotide
qT4s*kq
r sequences and their protein translations. This database is produced at National Center for
./'n2$^3 Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular
(Q+3aEUE Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA
=Wa\yBj_;m Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in
7SCI_8` laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to
_?ZT[t<
grow at an exponential rate, doubling every 10 months. Release 134, produced in February 2003,
,R+u%bmn# contained over 29.3 billion nucleotide bases in more than 23.0 million sequences. GenBank is built
f] }F_] by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
[6|vx},N sequencing centers.
#,56vVY Direct submissions are made to GenBank using BankIt [
http://www.ncbi.nlm.nih.gov/BankIt/],
0bE_iu>f' which is a Web-based form, or the stand-alone submission program, Sequin [http://
nq=fSK( www.ncbi.nlm.nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staff
XvdhPOMy assigns an Accession number to the sequence and performs quality assurance checks. The
X#y l8k_ submissions are then released to the public database, where the entries are retrievable by Entrez or
\<} nn?~n downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site
v6:DA#0 (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are
p
P @#|T most often submitted by large-scale sequencing centers. The GenBank direct submissions group also
r!S iR( processes complete microbial genome sequences.
VA0TY/{
] History
~#@EjQCq Initially, GenBank was built and maintained at Los Alamos National Laboratory (LANL). In the early
PewLg<?,G4 1990s, this responsibility was awarded to NCBI through congressional mandate. NCBI undertook
($wYawz the task of scanning the literature for sequences and manually typing the sequences into the data-
zP9!fA base. Staff then added annotation to these records, based upon information in the published article.
-AX3Rnv^! Scanning sequences from the literature and placing them into GenBank is now a rare occurrence.
Fv
9Z'#t Nearly all of the sequences are now deposited directly by the labs that generate the sequences.
^?2txLv,6 This is attributable to, in part, a requirement by most journal publishers that nucleotide sequences
Hpt)(Nz: are first deposited into publicly available databases (DDBJ/EMBL/GenBank) so that the Accession
eZ
cm3=WV| number can be cited and the sequence can be retrieved when the article is published. NCBI began
NQG"}=KA 1-1
2B4c:jJ W.a/k7 p NCBI Handbook GenBank
Y%faf.$/9 accepting direct submissions to GenBank in 1993 and received data from LANL until 1996. Cur-
EY>A(
rently, NCBI receives and processes about 20,000 direct submission sequences per month, in
`&_qK~&/X addition to the approximately 200,000 bulk submissions that are processed automatically.
Bg`b*(Q International Collaboration
(:\hor% In the mid-1990s, the GenBank database became part of the International Nucleotide Sequence
\;3r Database Collaboration with the EMBL database (European Bioinformatics Institute [http://
h2q]!01XP
www.ebi.ac.uk/], Hinxton, United Kingdom) and the Genome Sequence Database (GSDB; LANL,
%;<