Background: Computers have become an important tool for biologists. In addition to their use for general tasks such as word processing, analyzing data and preparing figures, a variety of programs have been developed for specific biological applications such as DNA sequence analysis or predicting protein structures. In particular, computers have been essential for storing and making sense of the vast amount of DNA and protein sequences which have been acquired by the various genome projects. This has given rise to the field of bioinformatics: the science of making sense of biological information. Bioinformatics combines computer science and biology in order to understand the biological significance of a variety of data. Bioinformatics is one of the fastest-growing fields in science; it has been used for applications ranging from designing new drugs to comparing entire genomes. One of the reasons for the rapid growth of bioinformatics is the wide variety of resources for biologists that have become available on the Internet. These include databases where a variety of biological information is stored for public access, online applications for analyzing many different kinds of data, electronic books and journals, sites where many programs may be downloaded, tutorials and short courses in various subjects, "phone books" to find biologists with specific interests and many other things.

 

Today we will learn about some of the bioinformatics resources that we will use during this course. We will start by looking at some databases for biological information and ways to search them. Genbank is the most widely used bioinformatics database because it contains all the DNA sequences that have been determined by any publicly-funded genome project. Genbank is a collaborative effort between the NCBI (National Center for Biotechnology Information) at NIH, DDBJ (DNA Database of Japan) and EMBL (European Molecular Biology Laboratory). All three exchange information daily, so sequences submitted to any one may be accessed from all three. This is important, because it is frequently difficult to access NCBI due to heavy usage, but it is rare that all three sites are jammed. NCBI contains many databases: in addition to Genbank it contains databases of protein sequences, protein structures, biomedical literature, information about genetic disorders and many other sorts of information. All of these databases may be queried using a search and retrieval system called Entrez.

We will start by looking at ways to search Genbank for specific genes, by looking up the sequence for one of the genes involved in heterosis by regulating circadian rhythms identified by Ni et al, 2009 (Nature 457: 327- 331). We will first find the sequence for the Arabidopsis gene, then use it to see if rice has any homologous genes using programs called BLAST and FASTA. This is a very commonly used procedure, because during any genome project as new sequence is obtained you wish to find out if it is related to any sequence of known function. In fact, most genes identified by the human genome project are of unknown function, and even those of known function have mainly been identified by their resemblance to known genes from other organisms. Therefore, many programs have been developed for aligning sequences and determining how closely they are related based on this alignment.

 

Next we will use BLAST to see whether the indica and japonica genes differ; if so, how different are they? Assuming that we do detect differences, we will use a program called Primer 3 to design gene-specific primers, then we will test that these primers donÕt bind elsewhere in the rice genome using BLAST at GRAMENE. We will then double-check that the primers donÕt have other issues using some programs that check for self-hybridization, stem-loop formation, etc.

 

After that we will look at two other websites that post biochemical pathways online with links to the genes encoding the enzymes catalyzing each reaction, and see whether we can identify any additional candidates for further study.

 

Procedure

 

1.     Start your favorite browser and load the course webpage: http://staffweb.wilkes.edu/william.terzaghi/AdvC&M.html

 

2.     Click on ÒNational Center for Biotechnology Information. Ò The "National Center for Biotechnology" page should appear.

 

3.     Type ÒgiganteaÓ in the search window.

 

4.     Click Ònucleotide,Ó then ÒGIÓ immediately below ÒThis search in Gene shows 8 resultsÓ.

 

5.     Scroll to the bottom of the page and copy ÒNM_102124.2Ó

 

6.     Go to BLAST search http://blast.ncbi.nlm.nih.gov/Blast.cgi , select Ònucleotide BLASTÓ , then paste NM_102124.2 into the search window, select Òothers, nucleotide collection (nr/nt)Ó as database. rice as organism and click ÒBLAST.Ó

 

7.     Click on NM_001048755.1

 

8.     Copy the entire nucleotide sequence at the bottom of the page.

 

9.     Now go to ÒGrameneÓ http://www.gramene.org/, and select ÒBLASTÓ under the search menu.

 

10.  Paste your sequence into the search window, select ÒRice_indicaÓ as species and ÒcDNAsÓ as database, then click ÒRUN.Ó(we could have done this at NCBI or many other places, but this way we narrow our search to the indica rice sequences).

 

11.  Now look at the outputs, and identify a region that shows a useful number of mismatches and click A (for alignment).

 

12.  Open a new browser window and go to SDSC Biology workbench http://workbench.sdsc.edu/

 

13.  Open an account, then go to Ònucleic tools.Ó

 

14.  Select Ò add new nucleic sequenceÓ then click Òrun.Ó

 

15.  Paste your sequence into Microsoft word and edit out all of the numbers. Go to Òreplace|special|any digitÓ and click replace all, then copy the edited sequence

 

16.  Name your sequence ÒJaponica Gigantea,Ó then paste in the edited sequence and click Òsave.ÓNow select your sequence, then Òprimer 3Ó and click Òrun.Ó

 

17.  Enter your target region Ò1183 -2362Ó then click ÒsubmitÓ

 

18.  If you like what you see, select Òimport sequence.Ó

 

19.  Now copy each primer sequence, and run a BLAST search at GRAMENE to see where it binds.

 

20.  Go to PCR Primer tools http://molbiol-tools.ca/PCR.htm , then select oligo analyzer http://www.idtdna.com/analyzer/Applications/OligoAnalyzer/ and check your primers for bad habits.

 

21.  Now letÕs go to KEGG http://www.genome.jp/kegg/ to find some other candidates. Select pathway, then starch metabolism, and click on 3.1.1.11 (pectin esterase).

 

22.  Click on Os08g0450100, then select the nucleotide sequence and compare the differences between the indica and japonica genes,

 

23.  LetÕs see how highly it is expressed. Go to MPSS http://mpss.udel.edu/ , select rice as the database, and enter LOC_Os08g45010, then click Òget data.Ó

 

24.  You will get a list of numbers which tells how often it was found out of a million transcripts.

 

25.  LetÕs look at this gene with the rice genome browser, Go to http://rice.plantbiology.msu.edu/ enter your gene in the Landmark window and click ÒSearch.Ó

 

26.  Select Yale tiling array Profile (forward and reverse), EST read pairs, FLcDNAs ,then click update image.

 

27.  LetÕs do the same exercise at GRAMENE. Go to ÒGrameneÓ http://www.gramene.org/, and select ÒPATHWAYSÓ under the search menu.

 

28.  Pick a pathway related to yield, then pick out a gene.

 

29.  LetÕs each try to find 5 genes with at least 10 mismatches between japonica and indica, and design primers to tell them apart