If we begin with an actual nucleic acid sequence, we can quickly perform a BLAST search to identify it. We might begin with a BLASTN (nucleic acid query versus a nucleic acid database). This search could be narrow or broad in scope. The BLASTN should tell us whether our sequence is similar to any other sequences in the database. In many cases, this will also turn up a putative translation to protein and a host of links to informative resources which might tell us about diseases for instance. If no translation is available we might use the BLASTX (nucleic acid query versus a protein database). If we are lucky the, BLASTX search may indicate some or all of the coding regions which are assembled to make a product. We might as well BLAST the sequence against the chromosomes of the parent species. it's quite possible that links which turn up from these BLAST searches will get us into the information known about this sequence or let us know we have something truly unique. On the other hand, we may discover that our query is "close" to a known sequence but not identical to it. Are the differences reflecting our (sequencing?) errors? Better yet we may see that it's identical to something for which no annotations are yet available. How can we proceed to assess the new sequence? If we have done a species specific search we might look for orthologs and from there we might get our understanding underway.
NCBI has begun a series of curated databases. These databases contain annotations collated by NCBI for a particular gene and often referenced in LocusLink (more on LocusLink later). These sequences are Reference Sequences (RefSeq). The BLAST and ENTREZ searches can locate such sequences. You can recognize them by their distinctive accession number formats (as compared with Genbank Accessions). Here is the RefSeq Accession Key. Note if you use GCG at MUSC, our GENBANK datafiles do NOT contain the RefSeq Entries. If you use ENTREZ to locate sequences, then try to fetch them within GCG you will not be successful. The RefSeq entries are not in the MUSC Genbank archive at present.
Open Reading Frame Finder ORF Finder at NCBI
Screen a sequence for Vector sequence contamination VecScreen at NCBI
Match mRNA or mRNAs to a genomic Contig with SPIDEY at NCBI
All commercial sequence analysis program suites have a host of DNA tools which perform restriction enzyme mapping, codon bias, translation and a host of additional analyses. If you have access to such software you should probably make use of it because an integrated analysis tool kit will be the most time efficient way for you to examine your sequence. The GCG command line, GCG via SeqLab, GCG via SeqWeb, GCG via W2H, EMBOSS command line EMBOSS via W2H, MACVector, VectorNTI Suite, Sequencher, even old DNAStrider all perform basic functions. By the way ALL of those packages from the above screen shots are available here at MUSC. Currently, however ,many of these same basic functions are actually available from specific web sites (eg Baylor College of Medicine Sequence Analysis tools--no registrations or Biology WorkBench -you have to create an account). There is not currently a site as comprehensive as many of these packages. Besides you STILL have to learn how to use their site/software combination in a handy way.
Basic DNA analysis,reformat, statistics, CpG analysis and more are available from the Sequence Manipulation Suite (MUSC mirror)
Restriction Enzyme Cut Site Mapping is available from two decent sites WebCutter and TACG.
Primer design became Web accessible some years ago courtesy MIT Whitehead and Steve Rozen see PRIMER3-MIT , PRIMER3-MUSC
Transcription Factor Binding Site finding, Promoter Location and UTR (from a BCR Web page) analysis are available here.
Gene Finding tools (also from a BCR Web page) that are not BLASTX
Nearly every Bioinformatics-related web server/site out there has something like a "related sites" link. Here's one of the better maintained sites. Amos Bairoch's Bioinformatics Links. Here's a site that's part of the SRS suite at EBI in the UK. Click on the Tools Tab and pick a Nucleic function.
>92403 Unknown for testing CCCACAGGGGGACCGGCCCTGTGACCCCTCACCGGGGCCGTGGGCCCGAGCCCCGGACTT CCCTAAGCCGGCAATGACCGCCTGCGCCCGCCGAGCGGGTGGGCTTCCGGACCCCGGGCT CTGCGGTCCCGCGTGGTGGGCTCCGTCCCTGCCCCGCCTCCCCCGGGCCCTGCGCCGGCT CCCGCTCCTGCTGCTCCTGCTTCTCCTGCAGCCCCCCGCCCTCTCCGCCGTGTTCACGGT GGGGGTCCTGGGCCCCTGGGCTTGCGACCCCATCTTCTCTCGGGCTCGCCCGGACCTGGC CGCCCGCCTGGCCGCCGCCCGCCTGAACCGCGACCCCGGCCTGGCAGGCGGTCCCCGCTT CGAGGTAGCGCTGCTGCCCGAGCCTTGCCGGACGCCGGGCTCGCTGGGGGCCGTGTCCTC CGCGCTGGCCCGCGTGTCGGGCCTCGTGTGTCCGGTGATCCCTGCGGCCTGCCGGCCAGC
Here is a section of genomic
DNA that's about 7,000bp. This is a section of some
"not quite ready for prime time" sequence but it's all I'm giving
you for now.
Is it coding or junk? If coding what protein? What species? Are there CpG islands?
Are there repeat regions, regions of codon bias?