Web-based tools for Bioinformatics; A (free) introduction to (freely available)
NCBI, MUSC and World-wide Bioinformatics Resources.
When and Where---Wednesdays at 1pm Room 438 Library Admin Building Beginning
September 10.
Overview September 10
We are going to look at the National Center for Biotechnology Information (NCBI)
main page to review something of the scope of the NCBI. Here is the NCBI
site map. NCBI maintains the GenBank archive, coordinates daily with the
DDBJ and EMBL counterparts to exchange newly submitted data. NCBI has information
hubs which function as archive only but has developed a growing list of curated
databases the aim of which is enhanced and more precise bioinformation retrieval.
Keeping up with the growth in database size is worth reexamining from time
to time. Here's one page that will link to most statistical
summaries. Current notes about Genbank release may be found at
this site.
There are a number of on-line
materials and tutorials which NCBI has prepared which can help you orient
yourselves in this Bioinformation Age. Its a good idea to review
this Science primer
for your sense of how this bioinfomatics world is organized. There is even
a "field guide"
to the NCBI resources. Check Here for an NCBI problem
set to work through. There are a number of books now on-line at NCBI On-line
Books . The European Bioinformatics Institute also offers an excellent site for
bootstrapping your knowledge base on bioinformatics. Here is the EBI
2can site. 2can has explanations and tutorials and much more.
Introduction/Scope
The central software of the NCBI is the ENTREZ
system. Here is an on-line help document for ENTREZ
and a specific ENTREZ
tutorial.
The concept of "neighbors" is one of the most valuable features of
the ENTREZ system. These neighbors are provided via the "Related"
records link in Entrez search results. These neighbors are precomputed by NCBI
in a database relative manner. The "neighboring methods" are listed
below. The value added by this feature is that you can often locate information
that otherwise might not appear in a straightforward search. ENTREZ serves as
a central information hub for the following databases.
The ENTREZ basic search is simply to type in a phrase and see what happens.
You might get lucky and you will get related resources from which you might
relatively quickly get your answer. There are however much more sophisticated
methods of searching. The limits, preview-index and Cubby links can aid you
construct more sophisticated searches. The History mode allows you to combine
searches. The Details mode allows you to view how the system has interpreted
your query; this is a handy way to learn the syntax for more sophisticated query
building. Cubby in particular is useful if you want to regularly scan ENTREZ
for new examples of the same types of references.
The ENTREZ page at NCBI shows the main integrated databases of the Resource.
- PubMed: biomedical literature:PubMed
Is this better than OVID? Yes, No, Maybe. Try both on the same searches. Better
still ask a Reference Librarian if you want professional assistance.How does
it compare to SciFinderScholar? If you seek CAS registry numbers and or structures of drugs
while literature is sort of secondary importance, use SciFinderScholar. If you want to perform
experiments and you need literature references and sequences information, use
ENTREZ. They each have features of great utility. They all supply reference
lists suitable for most commercial reference managers such as ENDNOTE. ENTREZ
is by far better at any sort of sequence based search and retrieval. Here is
a concise comparison of OVID vs PubMed vs SciFinderScholar.
- Nucleotide: sequence database (GenBank)
- Protein: sequence database
- Structure: three-dimensional macromolecular structures
- Genome: complete genome assemblies
- Books: BookShelf online books
- Domains: conserved domains (CDD)
- 3D Domains: domains from Entrez Structure
- GEO: Gene Expression Omnibus
- GEO Datasets: curated GEO data sets
- Journals: journals in Entrez
- MeSH: medical subject headings
- NCBI Web Site: NCBI Web site search
- OMIM: Online Mendelian Inheritance in Man
- PMC: full-text digital archive of life sciences journal literature
- PopSet: population study datasets
- SNP: single nucleotide polymorphisms
- Taxonomy: organisms in GenBank
- UniGene: gene-oriented clusters of transcript sequences
- UniSTS: markers and mapping data
Practical Use of ENTREZ
In this worked example we will start with the general search in the PubMed
database for the literature referring to "caspase". We will then focus the search
using various means and finally use the system to expand a search. The use of
these tricks should apply to any of the databases of the ENTREZ system. In subsequent
weeks we will use the other databases. Today we will stick mostly with PubMed.
Open the ENTREZ
PubMed link and type the word "caspase" .

On September 4, 2003 at 12:45pm OVID search retrieved 10,583 references. ENTREZ
Pubmed 11,349 (up from 11,182 Aug 27), SciFinder Scholar 23,305 containing the
term and 23,344 with the "concept"
Limit the Caspase search by requiring that caspase appear in the title ( not
just anywhere in the searchable text) eg "caspase[titl]"

This reduces the number of hits to 3000 or so.
It may prove interesting to see a concise view of the search progress. The
"Preview/Index"
menu allows you to add terms in simple or more complex ways to the current query.
The resultant
line shows you the form of the query and the number of resulting hits.
Next lets use the boolean operator "AND" to require that the word
"apoptosis" also appear in our search. eg "caspase[titl] AND
apoptosis"
This drops the number of hits further but just down to around 2700 articles.
Now lets amend the search and try a more drastic limiting procedure. Lets
search for caspase in the title, and the word ischemia. Lets use the "Limits"
menu to search for articles referring to human and then limit by date as well
to articles published Jan 1 2003 to Aug 27, 2003.

This results in just one publication. So, by judicious use of various limiting
features we have reduced the initial 11000 plus references to just one. We have
severely narrowed the search.
Note that by selecting the "DETAILS" menu we can see (and edit) the actual
interpretation of the query terms as well as see what the results might be in
summary terms. By selecting the "HISTORY" menu you can see previous searches;
each indicated by a "#" sign and a number. You can then "combine" searches in
the history listing by typing say "#2 AND #6" or "#1 NOT #2" etc.
Now we can employ the useful concept of neighboring. This will find us similar
articles. We are now broadening the search. Click on the "Related Articles"
link to the far right of the citation.
This finds us 95 articles.
Looking through the list we see regular "text" icons and a bit further
down, another icon.

Clicking on the J Biol Chem article we can see the abstract and a link to
the full text of the article.
If we next click on the "Links" item for this single entry we can
see a protein entry. Clicking that in turn brings us to a protein sequence entry
for a "scramblase". From there we can enter the sequence and possibly
the structure databases.
These exercises serve to illustrate the basic features of the ENTREZ system.
You can start with rather general word and locate may articles in PubMed which
in one or another way refer to the word. This hit list can be extensivly reduced
by carefully applying limits. From any single reference, we will be able to
use the "neighboring" functions to retrieve related articles;thereby
expanding a given search. In some circumstances, these references may have links
to sequence and/or structural information. This is why the BCR page places the
ENTREZ link under the "Relational Databases" link. That is the tremendous
power of the ENTREZ system to combine neighboring and relational features and
allow the searcher to customize searches to locate articles.
Note two things. Not all journals are indexed by PubMed. PubMed is not instantaneous.
Some articles are indexed quickly. Others take a while. Watch the National Media
(NPR/PBS, NYTimes etc) for references to new research findings and publications
and then watch to see how long it takes PubMed to list the citation.
As researchers, you will find that publication can be a bit late in the
game. Attendance at national/international meetings and direct communication
can still keep you more or less up-to-date in a given field. These Internet
tools are superb to aid you getting up to speed on a complex research topic.
Sample Questions/Data
Try some of the same searches with other databases eg Nucleotide or Protein divisions of ENTREZ. You
will find that the text searching concepts can be applied to these sequence databases as well. The Limits menu options
will however be more database specific. Be sure to examine the Links for any related
entries from the other databases. Note as well that the format of the sequences as well as the format of the literature
hits can be presented in various formats. The retrieval format you choose will depend on the end use of the information.
Programs such as ENDNOTE require the "MEDLARS" format of a literature reference, whereas many sequence analysis programs
look for the "FASTA" format. In other circumstances you might prefer "XML" format. Keep your end-use in mind.
Examples of how to keep up:
Subscribe to a newsletter such as HumMolGen.
Another option is to subscribe to Science Magazine's Editor's Choice e-mail service.
Harvest Sequence Information@EMBL
Here's the complete link to the Bioinformatics Harvester
Cancer susceptibility OGG1
NCBI and ENTREZ do not have all the answers. Here's an example.
Suppose you wish to identify potential transcription factor binding sites in
your DNA sequence. Ok bioinformatics sleuth what would YOU do? Here are some
suggestions.
- NCBI Site Map;
Use your browser's EDIT/find menu to search---Nope they do not help.
- BCR Promoter and Transcription
Factor Links
- Amos' WWW links page:
Use edit/find in page; find hits such as TRADAT, AraC-XylS, TFII, TRANSFAC
- Biocatalog by EBI: browse
"DNA" category, "Sequence Analysis" subcategory, site search or search
via SRS server; find hits such as SIGNAL SCAN, Matrix Search, Pol3Scan,
POLYAH/TSS/TSSG, Promoter Scan, MatInd/MatInspector, ConsInspector, tRNAscan-SE;
also try searching for "transcription" in the Biocatalog via SRS I
- INFOBIOGEN's DBCAT:
also has "DNA" category; edit/find in page through that category; find hit
for TRANSFAC; also edit/find in page for "Protein" category; find hit for
HOX DB
- Nucleic
Acids Research 2003 Database issue: a search for "transcription" find
>40 hits for various databases whose articles contain the word "transcription";
some of those databases might be associated with search software; read articles
to find out
- OR NAR database compilation
article: a search of summary articles find many of the databases retrieved
by a search of the 2003 Database issue, and also retrieves additional databases
and software such as TRANSFAC (database associated with various software programs)
and TESS (web-based service that searches DNA sequence for transcription factor
binding sites)
From a MolGen Alert you extract the following quote "the lipid leukotriene B4 (LTB4)
is a major mediator of the migration of 'effector' T cells.". What's leukotriene?