Free Essay

Gene Recognition

In:

Submitted By naks89
Words 8197
Pages 33
Gene Recognition

A project report submitted to

M S Ramaiah Institute of Technology
An Autonomous Institute, Affiliated to

Visvesvaraya Technological University, Belgaum

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

Submitted by

Mudra Hegde 1MS07CS052 Nakul G V 1MS07CS053

Under the guidance of

Veena G S Assistant Professor Computer Science and Engineering M S Ramaiah Institute of Technology

[pic]

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
M.S.RAMAIAH INSTITUTE OF TECHNOLOGY
(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
www.msrit.edu

May 2011

Gene Recognition

A project report submitted to

M. S. Ramaiah Institute of Technology
An Autonomous Institute, Affiliated to

Visvesvaraya Technological University, Belgaum

in partial fulfillment for the award of the degree of

Bachelor of Engineering in Computer Science & Engineering

Submitted by

Mudra Hegde 1MS07CS052 Nakul G V 1MS07CS053

Under the guidance of

Veena G S Assistant Professor Computer Science and Engineering M S Ramaiah Institute of Technology

[pic]

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
M. S. RAMAIAH INSTITUTE OF TECHNOLOGY
(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
www.msrit.edu

May 2011

Department of Computer Science & Engineering

M. S. Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054

[pic]

CERTIFICATE

This is to certify that the project work titled Gene Recognition is carried out by 1MS07CS052 Mudra Hegde and 1MS07CS053 Nakul G V in partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science and Engineering during the year 2011. The Project report has been approved as it satisfies the academic requirements with respect to the project work prescribed for Bachelor of Engineering Degree. To the best of our understanding the work submitted in this report has not been submitted, in part or full, for the award of any diploma or degree of this or any other University.

Veena G S (Dr.R. Selvarani) Guide Head, Dept of CSE

(External Examiner)

Department of Computer Science & Engineering

M. S. Ramaiah Institute of Technology
(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054

[pic]

DECLARATION

We hereby declare that the entire work embodied in this report has been carried out by us at M. S. Ramaiah Institute of Technology under the supervision of Veena G S. This report has not been submitted in part or full for the award of any diploma or degree of this or any other University.

1MS07CS052 MUDRA HEGDE
1MS07CS053 NAKUL G V

Abstract

A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Prokaryotic genes are relatively easy to find compared to Eukaryotic genes because they lack introns. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (exons) interrupted by introns. The recognition of the promoter regions in a eukaryotic genome is a daunting task. There is a lot of sequencing data that has been generated and they need to be annotated. It helps in having a better understanding of the organism, in drug discovery and also for finding a cure for various genetic disorders.

Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge.

In our work we try to find the coding and non-coding regions of an unlabeled string of DNA nucleotides using Hidden Markov model. A Hidden Markov Model is a generalization of a Markov chain, in which each (“internal”) state is not directly observable (hence the term hidden) but produces (“emits”) an observable random output (“external”) state, also called “emission”, according to a given stationary probability law. HMM employs dynamic programming algorithms like Viterbi, Forward-Backward algorithms to aid in gene recognition.

i

ACKNOWLEDGEMENT

We consider ourselves privileged to express gratitude and respect towards all those who guided us through the completion of this project.
Firstly, we would like to thank Dr. R. Selva Rani, Head of Department, Department of Computer Science and Engineering, MSRIT, Bangalore for giving us the opportunity to do this project and also for the excellent lab facilities provided.
We would like to express our gratitude to our project guide, Mrs. Veena G.S; we are privileged to experience a sustained enthusiastic and involved interest from her side. We would also like to express our gratitude to Dr K.G Srinivasa, Professor, Department of Computer Science and Engineering for his support.
We would also like to sincerely thank Mr. Shashidhara H S, Associate Professor, Information Science, for all the additional inputs he has given and also helped us gather more information on the various aspects involved in this project. We would like to thank the faculty of the Department of Computer Science and Engineering and the institute for extending a helping hand at every juncture of need and making this possible.

Mudra Hegde Nakul G.V

ii

LIST OF FIGURES

Figure 3.1 Use-case Diagram........................................................................................ 14

Figure 3.2 Flowchart..................................................................................................... 15

Figure 4.1 System Architecture..................................................................................... 18

Figure 4.2 Input Component.......................................................................................... 20

Figure 4.3 Preprocessing.................................................................................................21

Figure 4.4 Output Component........................................................................................22

Figure 4.5 Input Screen Shot1.........................................................................................23

Figure 4.6 Input Screen Shot2.........................................................................................24

Figure 4.7 Output Screen Shot........................................................................................25

Figure 5.1 Gene Model................................................................................................... 30

iii

Contents

Abstract i

Acknowledgements ii

List of Figures iii

Contents iv

1 Introduction

1. General Introduction 2. Statement of the Problem 3. Objectives of the project 4. Current Scope 5. Future Scope 1. Literature Survey

2.1 Prokaryotic Gene Structure 2.2 Eukaryotic Gene Structure 2.3 Hidden Markov Models 2.4 GENSCAN Algorithm

2. Software Requirements Specification

1. Introduction 1. Purpose 2. Scope of the Project 2. General Description 1. Project Perspective 2. End User Expectation 3. General Constraints 4. Assumptions and Dependencies 3. Specific Requirements 1. Functional Requirements 2. Non Functional Requirements 3. Software Requirements 4. Hardware Requirements 4. Interface Requirements 1. User Interface 5. Performance

3. System Design 1. Introduction and Design Overview 2. System Architectural Design 1. Chosen System Architecture 2. Discussion of Alternative Designs 3. Detailed Description of Components 1. Input Component 2. Preprocessing 3. Computational Component 4. Output Component 4. User Interface Design 1. Description of User Interface 2. Screen Images 3. Objects and Actions 5. Test Plan 1. Features to be Tested

4. Implementation 1. Hidden Markov Models in Gene Recognition 5. Testing 1. Introduction 1. System Overview 2. Test Approach 2. Test Cases 1. Case I 1. Purpose 2. Input 3. Expected Output & Pass/ Fail Criteria 4. Test Procedure 5. Test Results 2. Case II 1. Purpose 2. Input 3. Expected Output & Pass/ Fail Criteria 4. Test Procedure 5. Test Results

6. Conclusion & Scope for Future Work 7. Bibliography & References

Chapter 1
INTRODUCTION

1.1 General introduction

Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear.
A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. A promoter is a region of DNA that facilitates the transcription of a particular gene. Promoters are located near the genes they regulate, on the same strand and typically upstream (towards the 5' region of the sense strand). In order for the transcription to take place, the enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences and response elements which provide a secure initial binding site for RNA polymerase and for proteins called transcription factors that recruit RNA polymerase.
The coding region of a gene is that portion of a gene's DNA or RNA, composed of exons, that codes for protein. The region is bounded nearer the 5' end by a start codon and nearer the 3' end with a stop codon. The coding region in mRNA is bounded by the five prime untranslated region and the three prime untranslated region, which are also parts of the exons.

1.2 Statement of the Problem

This project aims at finding the annotations of a genomic sequence i.e the coding regions of a gene given an input DNA sequence which is not annotated.

1.3 Objective of the Problem To find the coding and non-coding regions of an unlabelled string of DNA nucleotides.

1.4 Current Scope
Gene Recognition is still in the stages of research and certain algorithms like GENSCAN and GrailEXP implement it with certain constraints.

1.5 Future Scope
The biggest challenge that the field of bioinformatics is facing is the huge amount of unannotated data that it has to deal with. It has huge databases of sequences with it but does not know what they represent. So, if the genes are annotated it can lead to a lot of mind-boggling inventions. Newer and better drugs can be developed for various genetic disorders. Any hereditary diseases can be detected in the offspring at a very early stage. It will also lead to the better understanding of various organisms.

Chapter 2 LITERATURE SURVEY

1. Prokaryotic Gene Structure

The organisms can be broadly classified into Prokaryotes and Eukaryotes. Prokaryotes are organisms that lack nucleus and membrane-bound organelles. Prokaryotes have a single chromosome, contained within a nucleoid region rather than a membrane-bound nucleus, but may also have various small circular pieces of DNA called plasmids spread throughout the cell. Their gene structure is much simpler than the gene structure of Eukaryotic DNA.

The gene is the functional unit of the DNA. A gene is a unit of heredity in a living organism. It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism. All living things depend on genes, as they specify all proteins and functional RNA chains. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring. A modern working definition of a gene is "a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions ". The prokaryotic gene is made up of three regions viz.

• Promoter Region

• Coding Region

• Terminator Region

A promoter region is the part of the gene that facilitates the transcription process. Promoters are located near the genes that they regulate, on the same strand and upstream (5’ region of the sense strand). For transcription to take place, the enzyme, RNA polymerase has to bind to a location near the gene. The promoter region contains specific DNA sequences that form the binding site for the RNA polymerase. The promoter region in the prokaryotes consists of two sequences. The first one, known as the Pribnow box, is the sequence of six nucleotides TATAAT. The other sequence consists of the seven nucleotides TTGACAT.

The coding region is the exons (prokaryotic genes are devoid of introns) which form the part of the DNA that is translated. The coding region starts with the initiation codon (ATG) and end with the termination codon (TAG or TAA or TGA).

The terminator region is the region that marks the end of the gene or the operon on genomic DNA for transcription (An operon is a functioning unit of the genomic material containing a cluster of genes under the control of a single promoter).

Prokaryotic genes generally overlap with each other which make the detection of translation initiation sites and the predictions of prokaryotic genes difficult.

Many gene finding programs for the prokaryotic genes have been developed, the earlier ones being ECOPARSE, ORPHEUS, GeneMark.hmm and the more recent ones such as GeneMark, GeneMarkS, EasyGene and GLIMMER. The GeneMark uses the Bayesian method. GeneMark.hmm uses the modification of the Viterbi algorithm of HMM (Hidden Markov Model) with duration to identify the most likely global path through hidden functional states given the DNA sequence. The extension of the GeneMark algorithm GeneMark.fba uses the forward-backward algorithm for local posterior decoding used in the HMM theory. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. Markov models of several orders were combined in the ‘interpolated’ model for gene prediction in the Glimmer algorithm.

2. Eukaryotic Gene Structure

The eukaryotic organism, as opposed to a prokaryotic organism, has a well-defined nucleus and membrane-bound organelles. In contrast to the prokaryotes, the eukaryotes have many chromosomes that are large and linear. The chromosomes in eukaryotes are also packaged by proteins into a structure known as the chromatin due to which very long chromosomes fit into the nucleus.

The Eukaryotic gene is much more complex than a prokaryotic gene. The eukaryotic gene consists of

• Promoter Region

• Exons

• Introns

• Terminator Region

The Start Site contains a sequence of 7 bases (TATAAAA) called the TATA box. The basal or core promoter is found in all protein-coding genes. This is in sharp contrast to the upstream promoter whose structure and associated binding factors differ from gene to gene. Many different genes and many different types of cells share the same transcription factors — not only those that bind at the basal promoter but even some of those that bind upstream. What turns on a particular gene in a particular cell is probably the unique combination of promoter sites and the transcription factors that are chosen.

Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically lie upstream of the gene and can have regulatory elements several kilobases away from the transcriptional start site (enhancers). In eukaryotes, the transcriptional complex can cause the DNA to bend back on itself, which allows for placement of regulatory sequences far from the actual site of transcription. Many eukaryotic promoters, between 10 and 20% of all genes, contain a TATA box (sequence TATAAA), which in turn binds a TATA binding protein which assists in the formation of the RNA polymerase transcriptional complex. The TATA box typically lies very close to the transcriptional start site (often within 50 bases). Eukaryotic promoter regulatory sequences typically bind proteins called transcription factors which are involved in the formation of the transcriptional complex.

An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule after either portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript.

Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called exons) interrupted by introns. The regions between genes are likewise not expressed, but may help with chromatin assembly, contain promoters, and so forth.

Intron sequences contain some common features. Most introns begin with the sequence GT (GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among them. Intron sequences may be large relative to coding sequences; in some genes, over 90 percent of the sequence between the 5′ and 3′ ends of the mRNA is introns. RNA polymerase transcribes intron sequences. This means that eukaryotic mRNA precursors must be processed to remove introns as well as to add the caps at the 5′ end and polyadenylic acid (poly A) sequences at the 3′ end.

3. Hidden Markov Models

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including speech recognition, optical character recognition, and problems in computational biology. The main computational biology problems with HMM-based solutions are protein family profiling, protein binding site recognition and the gene finding in DNA.

Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge.

A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding. A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence. The sequence of state transitions is a hidden process and is observed through the sequence of emitted symbols.

The field of computational biology involves the application of computer science theories and approaches to biological and medical problems. Computational biology is motivated by newly available and abundant raw molecular datasets gathered from a variety of organisms. Though the availability of this data marks a new era in biological research, it alone does not provide any biologically significant knowledge. The goal of computational biology is then to elucidate additional information regarding protein coding, protein function and many other cellular mechanisms from the raw datasets. This new information is required for drug design, medical diagnosis, medical treatment and countless fields of research.

Many effective tools based on HMMs have been created for the purpose of gene finding. Among the most successful tools are Genie, GeneID and HMMGene. Though each tool has a slightly different model, they each use the technique of combining several specialized submodels into a larger framework. The submodels correspond directly to different regions of DNA defined according to their function in the process of gene transcription. Most of the gene finding tools is hybrid models that include neural network components. In these tools, instead of an HMM, a neural network models certain regions, such as splice sites. The overall framework of an HMM-based gene finder combines the submodels into a larger model corresponding to the organization of a gene in DNA and its functional roles.

4. GENSCAN Algorithm

Identifying genes in DNA sequences by computational methods is a topic on which a lot of research has been made in the past few years. This problem deals with the precise sequence determinants of transcription, translation and RNA splicing. Softwares for exon prediction have become common in genome sequencing laboratories to identify genes in newly sequenced regions.

Early approaches to the gene recognition concentrated on the prediction of individual functional elements such as promoters, coding regions, splice sites and so on. But, the recent approaches to gene finding focus on integrating all these factors. Some examples of such approaches are: FGENEH, GENMARK, Gene ID, Genie, GeneParser and GRAIL II. Two important limitations of the currently existing algorithms are that they make an assumption that the input sequence has exactly one gene and the accuracy that is measured by independent control sets may be less than what was actually presumed. The accuracy is such that only 50% of the exons are actually identified. GENSCAN uses a general probabilistic model for the human genomic sequences. The overall architecture that the model uses is the Generalized Hidden Markov Model. This algorithm differs from the other algorithms in the followings aspects:

i) A double-stranded DNA sequence is considered with potential genes on both the sides of the DNA which are analyzed simultaneously and in an integrated fashion.

ii) The assumption that other algorithms have made, that the input sequence has exactly one complete gene is not made here. This model considers the fact that an input sequence may have a partial gene, a complete gene, multiple complete genes or no gene at all.

iii) It introduces a new method, Maximum Dependence Decomposition, to model the functional signals in DNA sequences

Chapter 3 SOFTWARE REQUIREMENTS SPECIFICATION 1. Introduction

Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear.

A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Gene Recognition is a particularly difficult problem in Bioinformatics. The complexity that the DNA sequences involve makes the task even more daunting. The solution to this problem can be found to a certain extent by using Hidden Markov Models.

A Hidden Markov Model is a generalization of a Markov chain, in which each (“internal”) state is not directly observable (hence the term hidden) but produces (“emits”) an observable random output (“external”) state, also called “emission”, according to a given stationary probability law. In this case, the time evolution of the internal states can be induced only through the sequence of the observed output states. If the number of internal states is N, the transition probability law is described by a matrix with N times N values; if the number of emissions is M, the emission probability law is described by a matrix with N times M values. A model is considered defined once given these two matrices and the initial distribution of the internal states.

The most used algorithms in Hidden Markov Models are:

1. The Forward Algorithm: To find the probability of emission distribution (given a model) starting from the beginning of the sequence.

2. The Backward Algorithm: find the probability of emission distribution (given a model) starting from the end of the sequence.

3. Viterbi algorithm: To find the sequence of internal states that has, as a whole, the highest probability. The most used algorithm is the Viterbi algorithm.

1. Purpose of the Project
The purpose of this project is to annotate the unannotated DNA sequences of various species such as Saccharomyces cerevisiae, Homo sapiens etc.

2. Scope of the Project
This project aims to recognize the genes thus helping in better drug discovery and a better understanding of the organisms.

2. General Description

3.2.1 Project Perspective
In this project we aim recognize genes in an unannotated sequence of DNA. The gene will be recognized with the start and the stop codon appropriately marked.

3.2.2 End User Expectation
The output to be expected will be the recognized genomic sequence of a particular species with all the parts of the gene properly identified such as the promoter, the introns, exons and the terminator region.

3.2.3 General Constraints
We come across a large number of constraints in gene recognition. Since the start codon, ATG, codes for the amino acid methionine it may appear even in the coding region. This poses a problem in order to identify the start of the gene. The promoters help in identifying the start of the gene. But, in eukaryotic organisms the promoter sequences are many and quite complex. The stop codons, TAA, TGA and TAG, also occur in the intergenic region. This also is considered as a constraint. Another constraint that we may come across is that the model cannot be made to work for sequences of all the species.

3.2.4 Assumptions and Dependencies
The assumptions that we are making in our project is that the input sequence is complete and does not contain any partial genes.

3. Specific Requirements

3.3.1 Functional Requirements
The functional requirements are as follows: i) Take an unannotated input sequence and annotate it. ii) Visualize the annotated output sequence.

3.3.2 Non-Functional Requirements
The performance required from the Forward and Backward Algorithm must be O( N2L) where L is the length of the sequence and N stands for the number of states in the model.

3.3.3 Software System Requirements
This gene recognition tool requires that the operating system be a Linux operating system with Apache server.

3.3.4 Hardware System Requirements • Processor: Intel Pentium (1.8 GHz) • RAM: 1GB • Hard Disk: 80 GB • Cache: 2 MB L2 Cache

4. Interface Requirements

3.4.1 User Interface
We need one GUI for accepting the input sequence and more GUI for visualizing the output annotated sequence.

[pic]
Fig. 3.1 Use-case Diagram

[pic]
Fig 3.2 Flowchart

5. Performance
The efficiency expected of the forward and backward algorithm is of the order of O(N2L). At run time, warnings are issued when iterators follow an inconsistent or out-of- bounds path, and when negative probabilities are encountered. Also, to improve the efficiency, constant transition probabilities are pre-computed outside the main loop whenever possible. Finally, all loops over transitions and states are unrolled, reducing the number of run-time decisions and lookups.

Chapter 4
SYSTEM DESIGN

4.1 Introduction and Design Overview

This project deals with one of the most challenging problems in Bioinformatics i.e. Gene Recognition. Gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge.

This document deals with the design of the architecture used for this project. The problem of Gene Recognition has been solved by using numerous methods. But, by far, Hidden Markov Models are most widely used. Hidden Markov Models do not present the problems that occur while Gene Recognition is being performed by pattern recognition.

The input will be a DNA sequence in FASTA format that is not annotated. It then undergoes preprocessing in order to be checked for the correct format and to be copied on to a temporary file without the description of the sequence. After this is done, the sequence goes through the computation phase where the Viterbi, forward-backward algorithms are used to find the optimal path, hence finding the gene. This output is then displayed to the user.

4.2 System Architectural Design

4.2.1 Chosen System Architecture [pic]

Fig 4.1 System Architecture

The system has four components: 1. Input Component

2. Preprocessing

3. Computational Component

4. Output Component

4.2.2 Discussion of Alternative Designs

One of the alternative designs that were proposed for the recognition of genes was that of pattern matching using parallel programming. It involves the recognition of start codons, promoter sequences, terminator codons and intrinsic terminator sequences in a given DNA sequence, hence recognizing the gene. But this design failed due to several reasons and it was also found to be an inefficient way of locating genes.

The first step involved identifying the terminator codons in parallel by dividing the input sequence into a fixed number of nucleotides. But, the major drawback here was that certain sequences in the intergenic region also matched the terminator codons.

Another major drawback is that, the start codon, ATG, also codes for the protein methionine. Hence, the sequence ATG may be present in the coding region also. This makes it difficult to recognize the start of the gene using pattern matching.

Another difficulty that was encountered was regarding the promoter sequences. Prokaryotic genes have two fixed promoter sequences that mark the start of a gene. But, when eukaryotic genes are considered the complexity of the promoter sequences increases. The sequences are extremely diverse and difficult to characterize. They lie several kilobases away from the transcriptional start site which makes it highly inefficient to search for it using pattern matching. The nucleotides in the consensus sequence may also vary from gene to gene. Hence, it becomes very difficult for the consensus sequence to be searched for in the given DNA sequence.

The termination of a gene may also be identified using sequences known as intrinsic terminators. The intrinsic terminators are a sequence of inverted repeat (5’ CAGTTA|TAACTG 3’) followed by up to six thymine nucleotides (TTTTTT). But, the complexity involved in this would be that the inverted repeat may be of any length.

Partial genes would also pose a problem to recognize a gene using pattern recognition. Hence, we planned to use Hidden Markov Models for gene recognition.

4.3 Detailed Description of Components
4.3.1 Input Component
The input component takes a text file containing DNA sequences in FASTA format as input. This acts as the sequence that has to be annotated.

[pic]

Fig 4.2 Input Component

4.3.2 Preprocessing
The description in the input file is removed and the remaining input, which is the DNA sequence, is copied onto another temporary file. If the user enters data is any format other than FASTA then appropriate error messages are displayed.

[pic]
Fig 4.3 Preprocessing

4.3.3 Computational Component
The computational component involved in this project is the application of Hidden Markov Models. The preprocessed sequence is taken as input and the Viterbi algorithm and forward-backward algorithm are applied to it. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, which will generate a given output sequence given the model parameters. The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions.

4.3.4 Output Component
This component takes the annotated sequence from the computational component and displays the gene. [pic]
Fig 4.4 Output Component

4.4 User Interface Design
4.4.1 Description of the User Interface
The user interface for this project involves page where the user may input the DNA sequence or may upload a file containing the sequence. The output is also displayed on the same page.

4.4.2 Screen Images
[pic]
Fig 4.5 Input Screen Shot 1

The user can enter his DNA sequence in the box provided with the label “Enter the input sequence in FASTA format”. If the user inputs a sequence that is not in FASTA format a dialog box opens with an error message asking the user to input the sequence in the correct format.

[pic]

Fig 4.6 Input Screen Shot 2

If the user wants to upload a file containing the input sequence instead of directly inputting the sequence he may do so by clicking the “Upload File” button. This opens a box wherein the user may type the name of the file that contains the input sequence.
[pic]

Fig 4.4 Output Screen Shot

Once the user inputs the sequence he may click “Submit” to obtain the annotated sequence in the box provided below “The annotated sequence is”.

4.4.3 Objects and Actions

The objects present on the user interface are:

Text Area1: To enter the input sequence directly. The text area is preceded by a statement “Enter the input sequence in FASTA format”.

“Upload File” button: If the user does not want to input the DNA sequence directly he may use this button to upload a file that contains the input sequence. When this button is clicked a text area appears where the user may enter the file name.

“Submit” button: This button may be clicked when the sequence or the file has to be submitted for annotating the sequence that they contain. Once this button is clicked the sequence gets annotated and the output is displayed.

Text Area2: This text area is used to display the output i.e. the annotated sequence. It is preceded by “The annotated output sequence is”.

“Exit” button: This button is used to exit the screen.

4.5 Test Plan
4.4.1 Features to be Tested
The features to be tested are as follows:

|Features to be tested |Input |Expected Output |
|Input in FASTA format |Sequence entered by the user |No- Error messages should be displayed |
| | |Yes- Proceed |
|Preprocessing |Sequence entered by the user |No- Error |
| | |Yes- Temporary file containing only the |
| | |sequence should be created |
|Start codon identification |Temporary file with sequence |ATG sequence should be identified at the |
| | |proper position |
|Terminator codon identification |Temporary file with sequence |TAA/TAG/TGA should be identified at the |
| | |correct position |
|Exons and Introns identification |Temporary file with sequence |Exons and introns should follow the AG-GT |
| | |rule and be identified at the proper position|

Table 4.1 Features to be Tested

Chapter 5
IMPLEMENTATION

5.1 Hidden Markov Models in Gene Recognition

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including speech recognition, optical character recognition, and problems in computational biology. The main computational biology problems with HMM-based solutions are protein family profiling, protein binding site recognition and the problem that is the topic of this paper, gene finding in DNA. Gene finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge.

A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding. A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence.

The problem of finding genes in DNA has been studied for many years. It was one of the first problems tackled once sufficient genomic data became available. The problem is given a sequence of DNA, determine the locations of genes, which are the regions containing information that code for proteins. At a very general level, nucleotides can be classified as belonging to coding regions in a gene, non-coding regions in a gene or intergenic regions. The problem of gene finding can then be stated as follows:

Input: A sequence of DNA X = (x1….xn) € ∑*, where ∑= A,C, G, T.

Output: Correct labeling of each element in X as belonging to a coding region, non-coding region or intergenic region.

Gene finding becomes complicated when the problem is approached in more biological detail. A eukaryotic gene contains coding regions called exons which may be interrupted by non-coding regions called introns. The exons and introns are separated by splice sites. Regions outside genes are called intergenic. The goal of gene finding is then to annotate the sets of genomic data with the location of genes and within these genes, specific areas such as promoter regions, introns and exons.

The Hidden Markov Model that we have used in our project is a shown below:

[pic]
Fig 5.1 Gene Model

The Gene Model shown above consists of six states, viz, Start, Background, Start Codon, Gene, Stop and End. These six states are divided into three blocks, viz, Block 1, Block 2 and Block 3. Block 1 consists of just the Start state which is the start of scanning the sequence. Block 2 consists of 4 states, Background, Start Codon, Gene and Stop. Block 3 consists of a single state, End. These states constitute the finite set of states of the Hidden Markov Model. The alphabet of output symbols in this HMM are the nucleotides A, T, G, and C.

The sequence moves from one state to another using state transition probabilities. The transition from the Start state to the Background State has ‘full’ probability, i.e. 1.0. The Background state has emission probability ‘emitbackground’. The Background state takes care of the intergenic region. Hence the nucleotides encountered may be A, T, G or C. Therefore, the emission probability for Background is emitbackground= 0.25. The model remains in the Background state until a start codon is encountered. The transition probability to remain in the Background state is ‘bgbg’ i.e. full-bgend-bgstart. Once the start codon ATG is found, the model moves from Background state to Start Codon state.

The transition from Background state to Start Codon state has the probability ‘bgstart= Gene Density’. The emission probability of Start Codon state is ‘emitstart’. If an ATG is encountered then the emission probability is 1.0 otherwise it is considered as 0.0. Once a start codon is encountered it leads to a gene, hence the model moves from the Start Codon state to the Gene state. The transition probability for this transition is ‘full’ i.e. 1.0 since once the start codon is found, the gene is found. There is more than a single codon that is encountered when we are in the Gene state. Hence, the probability of staying in the Gene state is ‘extend= 1.0- (1.0/ Gene Length)’.

The emission probability for the Gene state is ‘emitcodon’. This state emits any of the 61 codons (except TAA, TGA and TAG) that code for the amino acids that form the protein. Hence, if these codons are found the probability is 1/61 otherwise if stop codons are found the emission probability is 0.0. Then, a transition to the next state is made.

The next state is the Stop Codon state which is entered if any of the stop codons (TAA, TAG or TGA) are encountered. The transition probability from Gene state to the Stop state is ‘genestop= 1.0/ Gene Length’. Once the stop codons are encountered it marks the end of that particular gene. But the DNA sequence may have more nucleotides. Hence, the transition takes place from Stop state to Background state. This transition probability is ‘full’ i.e. 1.0. The Stop state emits the stop codons, hence the emission probability of the state is 1/3 if a stop codon is found otherwise it is 0.0.

Once the transition has taken place from Stop to Background the entire sequence is scanned and finally a transition from Background to End state takes place with a transition probability of ‘bgend=0.0001’. The End state belongs to Block 3 of the model and its emission is ‘empty’.

The algorithms used in the implementation of this project are Forward-Backward Algorithm and the Viterbi Algorithm.

The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions [pic], i.e. it computes, for all hidden state variables [pic], the distribution [pic]. The algorithm makes use of the principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward-backward algorithm.

The forward algorithm is used to find the next likely state in the given finite set of states. In our implementation the forward algorithm is implemented in a function called Forward which returns the next likely state. In this function the 8 transitions are defined according to our gene model. It scans the sequence in the forward direction and finds the likely states.

The backward algorithm is also used to find the next likely state in a given finite set of states but it scans the sequence in the reverse direction. In our implementation the backward algorithm has been implemented in the function called Backward. This function also returns the next likely state.

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models.

The algorithm makes a number of assumptions. • First, both the observed events and hidden events must be in a sequence. This sequence often corresponds to time. • Second, these two sequences need to be aligned, and an instance of an observed event needs to correspond to exactly one instance of a hidden event. • Third, computing the most likely hidden sequence up to a certain point t must depend only on the observed event at point t, and the most likely sequence at point t − 1.

In the implementation we have implemented Viterbi algorithm in 2 functions, Viterbi_trace and Viterbi_recurse. Viterbi_trace finds the most likely path in the forward direction whereas Viterbi_recurse finds the path in the reverse direction. These two functions return the most likely path followed by the sequence. Another function called addEdge is used to assign an edge from the ‘from’ state to the ‘to’ state with transitions, probability and emissions as parameters.

Chapter 6
TESTING
6.1 Introduction
6.1.1 System Overview
The implementation of our code has been done on the Linux operating system. The language that has been used for coding is C++ . The front end has been developed using PHP.

6.1.2 Test Approach
The input sequences have been taken from the GENBANK database. The output is tested manually in comparison to the details from the GENBANK database.

6.2 Test Cases
6.2.1 Case-1
6.2.1.1 Purpose
To test if the gene has been correctly identified with the start and stop codon appropriately marked.

6.2.1.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGTACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGATTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATACCCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTTTGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAAATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAACCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCTTCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCTGTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTGTCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCGTCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTAAAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGTGCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTATTAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGGTATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCTACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCGCAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAACAGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAAATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTTATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAGAATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAATAACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCATCACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAATCATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGAATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGCTCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAGCTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTTCAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTTTTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTTCTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTGATC

6.2.1.3 Expected Output & Pass/ Fail Criteria
The expected output is:
Gene Recognition: (upper case represents the Gene/s) gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacgagcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaaagacgcgaaaaaaaaagaacaacgcgtcatagaacttttggcaattcgcgtcacaaataaattttggcaacttatgtttcctcttcgagcagtactcgagccctgtctcaagaatgtaataatacccatcgtaggtatggttaaagatagcatctccacaacctcaaagctccttgccgagagtcgccctcctttgtcgagtaattttcacttttcatatgagaacttattttcttattctttactctcacatcctgtagtgattgacactgcaacagccaccatcactagaagaacagaacaattacttaatagaaaaattatatcttcctcgaaacgatttcctgcttccaacatctacgtatatcaagaagcattcacttaccATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGAttatacgcaacgatattttgcttaattttattttcctgttttattttttattagtggtttacagataccctatattttatttagtttttatacttagagacatttaattttaattccattcttcaaatttcatttttgcacttaaaacaaagatccaaaaatgctctcgccctcttcatattgagaatacactccattcaaaattttgtcgtcaccgctgattaatttttcactaaactgatgaataatcaaaggccccacgtcagaaccgactaaagaagtgagttttattttaggaggttgaaaaccattattgtctggtaaattttcatcttcttgacatttaacccagtttgaatccctttcaatttctgctttttcctccaaactatcgaccctcctgtttctgtccaacttatgtcctagttccaattcgatcgcattaataactgcttcaaatgttattgtgtcatcgttgactttaggtaatttctccaaatgcataatcaaactatttaaggaagatcggaattcgtcgaacacttcagtttccgtaatgatctgatcgtctttatccacatgttgtaattcactaaaatctaaaacgtatttttcaatgcataaatcgttctttttattaataatgcagatggaaaatctgtaaacgtgcgttaatttagaaagaacatccagtataagttcttctatatagtcaattaaagcaggatgcctattaatgggaacgaactgcggcaagttgaatgactggtaagtagtgtagtcgaatgactgaggtgggtatacatttctataaaataaaatcaaattaatgtagcattttaagtataccctcagccacttctctacccatctattcataaagctgacgcaacgattactattttttttttcttcttggatctcagtcgtcgcaaaaacgtataccttctttttccgaccttttttttagctttctggaaaagtttatattagttaaacagggtctagtcttagtgtgaaagctagtggtttcgattgactgatattaagaaagtggaaattaaattagtagtgtagacgtatatgcatatgtatttctcgcctgtttatgtttctacgtacttttgatttatagcaaggggaaaagaaatacatactattttttggtaaaggtgaaagcataatgtaaaagctagaataaaatggacgaaataaagagaggcttagttcatcttttttccaaaaagcacccaatgataataactaaaatgaaaaggatttgccatctgtcagcaacatcagttgtgtgagcaataataaaatcatcacctccgttgcctttagcgcgtttgtcgtttgtatcttccgtaattttagtcttatcaatgggaatcataaattttccaatgaattagcaatttcgtccaattctttttgagcttcttcatatttgctttggaattcttcgcacttcttttcccattcatctctttcttcttccaaagcaacgatccttctacccatttgctcagagttcaaatcggcctctttcagtttatccattgcttccttcagtttggcttcactgtcttctagctgttgttctagatcctggtttttcttggtgtagttctcattattagatctcaagttattggagtcttcagccaattgctttgtatcagacaattgactctctaacttctccacttcactgtcgagttgctcgtttttagcggacaaagatttaatctcgttttctttttcagtgttagattgctctaattctttgagctgttctctcagctcctcatatttttcttgccatgactcagattctaattttaagctattcaatttctctttgatc The gene should been correctly identified with the ATG and the stop codon TGA appropriately marked.

6.2.1.4 Test Procedure
The input file containing the DNA sequence is entered in the space provided and the program is run. The output that is obtained is matched with the GENBANK data to verify if the gene has been correctly identified.

6.2.1.5 Test Results
The gene has been correctly identified by the program with the ATG and TGA correctly marked as verified from the GENBANK record.

6.2.2 Case-2
6.2.2.1 Purpose
To test if the gene has been correctly identified with the start and stop codon appropriately marked when the sequence ends with the gene.

6.2.2.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|296148533|ref|NM_001183937.1| Saccharomyces cerevisiae S288c Rny1p (RNY1) mRNA, complete cds
ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGTATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCGTGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAAAACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGGATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGATGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGAAAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGATACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGGGGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTCAAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTATTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCTGTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCGAAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTTTTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTCGTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTGGATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGAGAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGAACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTATTCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTCTGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA

6.2.2.3 Expected Output & Pass/ Fail Criteria
The expected output is:
Gene Recognition: (upper case represents the Gene/s)
ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGTATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCGTGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAAAACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGGATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGATGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGAAAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGATACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGGGGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTCAAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTATTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCTGTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCGAAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTTTTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTCGTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTGGATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGAGAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGAACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTATTCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTCTGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA

The entire sequence should be in capitals since it is the entire gene.

6.2.2.4 Test Procedure
The input file containing the DNA sequence is entered in the space provided and the program is run. The output that is obtained is matched with the GENBANK data to verify if the gene has been correctly identified.

6.2.2.5 Test Results
The gene has not been correctly identified by the program with the ATG and TAA marked as verified from the GENBANK record. Hence, this is a fail case.

Chapter 7
CONCLUSION & FUTURE ENHANCEMENTS

In the previous chapters are described the model that we have used to find genes in the given DNA sequences. Our algorithm has been tested on Saccharomyces cerevisiae, Drosophila melanogaster, Zea Mays (maize) and Homo Sapiens sequences. It has been found to work partially on the sequences of Homo Sapiens and achieve about 80-90% success for the other sequences.

This field has a lot of future work to be done. The genes of many other organisms still need to be identified and with greater efficiency. The partial genes also have not been able to be identified using this program and a significant future enhancement would be to be able to identify partial genes. Multiple genes have been taken care of to a certain extent. Also, the recognition of promoter sequences is a daunting task and future enhancements can also take care of this. The introns and exons also need to be identified with greater efficiency.

Chapter 8
BIBLIOGRAPHY & REFERENCES

[1] Rajeev K. Azad, Mark Borodovsky, Probabilistic methods of identifying genes in prokaryotic genomes: Connections to the HMM theory, Henry Stewart Publications 1477-4054. Briefings in bioinformatics. Vol 5. No 2. 118–130. June 2004

[2] Alexandre Lomsadze, Vardges Ter-Hovhannisyan, Yury O. Chernoff, Mark Borodovsky, Gene identification in novel eukaryotic genomes by self-training algorithm, 6494–6506 Nucleic Acids Research, 2005, Vol. 33, No. 20

[3] David Sankoff, The Early Introduction Of Dynamic Programming Into Computational Biology, Oxford University Press, 2000, Vol 16 no1, Pages 41-47

[4] P.S. Novichkov, M.S. Gelfand, A.A. Mironov, Gene Recognition in Eukaryotic DNA by Comparison of Genomic Sequences, Oxford University Press, 2001, Vol 17 no11, Pages 1011-1018

[5] Chris Burge, Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, J. Mol. Biol. (1997) 268, 78-94

[6] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White, Microbial gene identification using interpolated Markov models, Nucleic Acids Research, 1998, Vol. 26, No. 2, 544-548

[7] Sanja Rogic, B.F. Francis Ouellette, Alan K. Mackworth, Improving gene recognition accuracy by combining predictions from two gene-finding programs, Oxford University Press, 2002, Vol 18 no8, Pages 1034-1045

[8] Andrey A. Mironov, Pavel S. Novichkov, Mikhail S. Gelfand, Pro-Frame: similarity-based gene recognition in eukaryotic DNA sequences with errors, Oxford University Press, 2001, Vol 17 no 1, Pages 13-15

[9] Y.V. Kondrakhin, A.E. Kel, N.A. Kolchanov, A.G. Romashchenko, L. Milanesi, Eukaryotic Promoter Recognition by Binding Sites for Transcription Factors, Computer Applications in Biosciences 11: 477-488, 1995

[10] Anton M.Shmatkov, Arik A.Melikyan, Felix L.Chernousko, Mark Borodovsky, Finding Prokaryotic Genes by the ‘frame-by-frame’ Algorithm: Targeting Gene Starts and Overlapping Genes, Bioinformatics, 15: 874-886, 1999

[11] Peter M. Hooper, Haiyan Zhang, David S. Wishart, Prediction of Genetic Structure in Eukaryotic DNA using Reference Point Logistic Regression and Sequence Alignment, Computer Applications in Biosciences, 16: 425 – 438, 2000

[12] Christopher Burge, Identification Of Genes In Human Genomic DNA, March 2007

[13] Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA, 308-761 Machine Learning Projec,t Kaleigh Smith, January 17, 2002

[14] Ion I. Mandoiu and Alexander Zelikovsky, Bioinformatics Algorithms-Techniques and Applications, John Wiley & Sons, 2008

[15] Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms, The MIT Press, 2004

[16] David W. Mount, Bioinformatics- Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press

[17] Valeria De Fonzo, Filippo Aluffi-Pentini, Valerio Parisi, Hidden Markov Models in Bioinformatics, Current Bioinformatics, 2007, 2, 49-61

[18] Catherine Mathe, Marie-France Sagot, Thomas Schiex, Pierre Rouze, Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30 No. 19 4103-4117

Similar Documents

Free Essay

Breast Cancer Threat

...Defining the Methyl-Specific Recognition of BRCT Domain Proteins An important tumor suppressor for preventing carcinogenic mutations with direct roles in DNA repair and transcriptional regulation is a gene known as Breast cancer gene 1(BRCA). These genes have several domains. These domains have important roles in tumor suppression as several pathogenic mutations have been mapped to BRCA1’s C terminal (BRCT). However, there is an emerging paradigm that BRCT domains have binding sites capable of recognizing, additional post-translation modifications (PTM’s) methyl ADP-ribosylation. In understanding the basic mechanism of gene mutation it is important to first understand the details of the novel protein mechanism. e.g. Methyl-dependent recognition has been demonstrated in BRCT domains. Such as BRCA1 and 53 BPI. Similarly Pax transactivation interacting protein 1 (PAXIP1/PTIP) contains tandem BRCT domains that display phosphor-specific recognition with critical roles DNA repair Transcription regulation.BRCA1, 53BP1 and PTIP all share a conserved mode of phosphor-specific recognition, therefore. Therefore the objective of the whole study is to determine if Tandem BRCT domains contain multiple PTM- recognition interfaces. Using-Peptide pull down experiments-we found that similar to BRCA1, PTIP/BRCT56 domain preferentially bind asymmetric dimtheyl-arginine (ADMA) peptides Conducting a mutational analysis with BRCA1, BRCT domain residues that are known to disrupt phosphor-specific...

Words: 650 - Pages: 3

Premium Essay

Restriction Enzyme Lab Report

...Function of Restriction Enzymes: Restriction endonucleases cleave the phosphodiester bond between an adjacent phosphate and deoxyribose group in the phosphate backbone of the DNA. The active site of the endonuclease perform this cleavage by binding to the side chain of certain amino acids to the phosphate group through a chemical bond. This dissolves the preexisting bond between the deoxyribose sugar and the phosphate resulting in a breakage with in the DNA chain at a specific location. (3, 7) One characteristic feature of restriction endonucleases is that they cut at a very particular site having a specific DNA sequence. This specific sequence that allows the enzyme to attach is known as the recognition site. Consider the example of the first restriction enzyme discovered, EcoRI....

Words: 866 - Pages: 4

Premium Essay

The Connection Between the Central Dogma of Molecular Biology/ Bioinformatics, Model Organism and Drug Designing.

...Report on the connection between the Central dogma of Molecular Biology/ Bioinformatics, Model Organism and Drug Designing. The basis of the central dogma of molecular biology is the expression of the genetic information in any call. It is a universal process that occurs in every cell. The genetic information is stored in the DNA. During gene expression DNA is transcript to RNA and these RNA are transcribed to proteins. Bioinformatics deals with the genetic information which involves collecting, analyzing, manipulating and predicting etc. For the functioning of bioinformatics it is essential to know the genetic information that is stored in DNA. Therefore sequencing of DNA, genes or genomes is the fundamental need in bioinformatics. Organisms that are used in biological experiments in laboratories are called ‘model organisms’, of which most genomes are sequenced at present (rat, yeast, Arabidopsis; plant model organism) These sequenced genomes could be analyzed using bioinformatics tools in order to identify genes of significance as in drought tolerance genes in plants etc. Information revealed from sequencing could be studied using bioinformatics tools to understand its underlying mechanisms and to generate models that could be used in further studies. This information could also be used in evolutionary studies, micro array analysis, identification of genetic disorders (Alzheimer’s disease, breast cancer, cystic fibrosis, spinal muscular atrophy etc.) ...

Words: 414 - Pages: 2

Premium Essay

Identity: the Constituents of Selfhood

...by the combined efforts of those around. In short, individuality manifests itself from the convergence of important factors: genetics, adolescent development and cultural influences. All these elements and circumstances combine to form one singularity: identity. Genes, the basic unit of heredity are a combination of nucleic acid and evolutionary black magic. They’re the mechanism behind one’s genesis; the framework on which their identity is fastened. These characteristics are immutable (current technology withstanding). Passed on from one’s biological parents; an homage to their legacy. Subjectively speaking, genetics are profoundly constitutive of self. A person identifies and is identified to a large degree by their phenotypes, or the outward expression of their genes. From eye color to skin color, genes have a profound impact on social development and ultimately, identity. For example, genetic traits that influence attractiveness and athleticism are contributing factors to the social strata of a person. There is little that a person can do to challenge their genetic makeup. For better or worse: it is theirs, and they are it. However, much can be attributed to the final summation of one’s identity. While genes are...

Words: 982 - Pages: 4

Premium Essay

Nt1310 Unit 1 Science Assignment

...Doctors can diagnose patients and detect what they are susceptible to according to the patient’s complement of genes. In Agriculture, for example, DAF Queensland has been proactive in planting Barley varieties that carry adult plant resistance genes Rph 20 and Rph 23 to prevent barley leaf rust damaging their crops. Scientists in Rotterdam have found a gene that affects how old people look. Companies can now work on a product to prevent the ageing process. The discovery of the CD33 Alzheimer’s disease gene has companies working on a drug to prevent nerve cell death. Forensic science has benefitted hugely because of DNA testing. Inheriting a genetic set of conditions such as cancer, heart disease, depression, etc., cannot be changed. However, we can choose to prevent it from happening by lifestyle changes in the workplace, at home, in school by taking the stress out of tasks, such as homework and difficult assignments. Epigenetics, above and beyond genes, is the 21st century...

Words: 1439 - Pages: 6

Premium Essay

What

...Alzheimer’s does not have a specific reason on what causes the brain cells to break after time/ deteriorate. There are several factors that might cause the brain to go through to lose brain cells and causing that person to be diagnosed with alzheimers. Some of the main factors that can cause a person to develop Alzheimer’s are age, genetics, and lifestyle factors. As we age, we start to lose brain cells once we reach the age of 65 our brain starts to lose recognition of the things we love to do and the people we spend our time with. The older we get, the percentage of having Alzheimer’s is higher. Our memory goes from long-term memory to short-term memory, Alzheimer’s doesn’t occur on old people it also occurs to young people but the chances are lower. Genes also play a role in the causes of Alzheimer’s; a family member who is diagnosed with Alzheimer’s can increase the chances of another family member to have condition of his own. People who have Down syndrome have a greater chance of getting Alzheimer’s, since they have a copy of chromosome 21. Chromosome 21 contains a protein that is present in the brain of people who have Alzheimer’s. There are lifestyle factors that contribute to Alzheimer’s, if there aren’t any physical activities, good diet, and stimulating activities/social engagement in ones lifestyle there is a greater chance of having Alzheimer’s. Exercise has many benefits. It strengthens muscles, improves heart and lung function. Diet is good because it helps preserve...

Words: 310 - Pages: 2

Premium Essay

Huntington's Disease Research Paper

...Huntington’s disease is an inherited disease that causes a breakdown of certain neurons in the brain. Huntington’s disease is a neurodegenerative disorder which causes involuntary movements, emotional disturbance, and cognitive impairment. “Huntington’s disease (HD) is caused by an autosomal dominant pathogenic mutation, resulting in an expanded stretch of 36 or more glutamine residues in the N terminus of the huntingtin (HTT) protein (1)” (Weiss et al., 2012, p.1 ). People who have Huntington’s disease are born with the faulty gene; however, symptoms typically don’t appear until the person reaches middle age. Although in most cases people with HD don’t start to experience symptoms until middle age, some people may start to experience them...

Words: 2171 - Pages: 9

Premium Essay

Food Web Diagram

...Unit II: Genetics Brief Overview Reading: Chapters 3, 4, 9-12, 14 (Note: you have reviewed much of this already) The earth is teeming with living things. We can easily see some of the larger organisms—trees, grass, flowers, weeds, cats, fish, squirrels, dogs, insects, spiders, snails, mushrooms, lichens. Other organisms are everywhere, in the air, in water, soil and on our skin, but are too small to see with the naked eye—bacteria, viruses, protists (single celled eukaryotes such as amoebae), and tiny plants and animals. Life is remarkable in its complexity and diversity, and yet it all boils down to a very simple idea—the instructions for making all this life are written in nucleic acids, usually DNA. Most organisms have a set of DNA that contains the instructions for making that creature. This DNA contains four “letters” in which these instructions are written—A, T, G, and C. The only difference between the code for a dog and the code for a geranium is in the order of those letters in the code. If you took the DNA from a human and rearranged the letters in the right way, you could produce an oak tree—arrange them slightly differently and you would have a bumble bee—arrange them again and you would have the instructions for making a bacterium. Acting through more than two billion years, the process of evolution has taken one basic idea—a molecular code that uses four letters—and used it over and over, in millions of combinations to produce a dazzling array of life forms...

Words: 32016 - Pages: 129

Premium Essay

A New Molecule of Life

...experimenting with a synthetic molecule called peptide nucleic acid (PNA).   PNA is an artificial polymer that has many similarities to deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). It has the same storing features as DNA and RNA while being built on a protein based backbone therefore making it sturdier and simpler than the sugar phosphate-backbone.    The molecule was created in hopes of having an immediate affect by pursuing a drug that would target DNA’s composing specific genes, to either enhance or block the gene’s expression.   This new drug would be in efforts to interfere with the production of disease producing proteins.   Although this molecule has produced highly anticipated medical research, it has also lead to speculations of being the origins of life. In his years of research, Peter Nielsen and his colleagues wanted to achieve the ability of PNA recognizing double-stranded or duplex DNA having specific sequences and also be able to link to it creating a triple helix.   The recognition in duplex DNA is far more challenging being that the atoms involved in the sequencing bases are already involved in hydrogen bonds connecting the two strands together and therefore making it unavailable to link to another molecule.   Further into Nielsen’s research, an experiment was done where duplex DNA...

Words: 521 - Pages: 3

Free Essay

Genetic Engineering

...not seem typical for our time and age, but truthfully it is what is becoming of our world. Through substantial research and experimentation that is taking place, scientists, specifically biologists, are becoming keener to the field of engineering; Genetic engineering that is.  When one thinks of “genetic engineering,” the first thought is probably a perfect child, or paradoxically some inconceivable creature, forged under the microscope in a scientific laboratory. Though both of these are genetic engineering, many people do not consider other things, such as genetic engineering of agriculture and medicine, both of which are extremely useful. Through the genetic altering of plants and crops, scientists have been able to manipulate their genes to withstand lower temperatures, to resist herbicides and insects, and to even extending shelf life of some particular products (Gert). This technology has made farmers more prosperous, as well as given the population more food that will last longer. In medicine, “a patent has already been applied for to mix human embryo cells with those from a monkey or ape to create an animal...

Words: 1596 - Pages: 7

Free Essay

Information

...Developmental Biology 394 (2014) 242–252 Contents lists available at ScienceDirect Developmental Biology journal homepage: www.elsevier.com/locate/developmentalbiology Marker genes identify three somatic cell types in the fetal mouse ovary Raphael H. Rastetter a,1, Pascal Bernard a,1, James S. Palmer b, Anne-Amandine Chassot c,d, Huijun Chen b, Patrick S. Western e, Robert G. Ramsay f,g, Marie-Christine Chaboissier c,d, Dagmar Wilhelm a,n a Department of Anatomy and Developmental Biology, Monash University, Clayton, VIC 3800, Australia Division of Molecular Genetics and Development, Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD 4075, Australia c University of Nice-Sophia Antipolis, UFR Sciences, Nice, France d INSERM U1091, CNRS UMR7277, IBV, Nice, France e MIMR-PHI Institute of Medical Research, Clayton, VIC 3800, Australia f Sir Peter MacCallum, Department of Oncology and the Pathology Department, The University of Melbourne, Parkville, VIC, Australia g Department of Pathology, The University of Melbourne, Parkville, VIC, Australia b ar t ic l e i nf o a b s t r a c t Article history: Received 27 June 2014 Received in revised form 12 August 2014 Accepted 15 August 2014 Available online 23 August 2014 The two main functions of the ovary are the production of oocytes, which allows the continuation of the species, and secretion of female sex hormones, which control many aspects of female development and physiology. Normal development of...

Words: 10419 - Pages: 42

Premium Essay

Reproductive Liberty

...Reproductive Liberty is the recognition of the basic rights of all couples and individuals to decide freely and responsibly the number, spacing and timing of their children and to have the information and means to do so, and the right to attain the highest standard of sexual and reproductive health. A couple or individual can have P.G.D(Preimplantation Genetic Diagnosis) done on their embryos. P.G.D will give a scintices a genetic map of the all the embryos made and what genetic diseases could be in early embryos like Tay-Sachs, Sickle cell anemia, Cystic fibrosis, Beta-cell Seniora and mitochondrial disease prior to implanting of the embryos. Only a handful of scientic in the world are capable of deciphering the genetic makeup of an embryo. To examine the embryos, first the...

Words: 761 - Pages: 4

Free Essay

Ap Biology

...5th November 27th, 2012 AP Biology : Chapter 15 Review Questions 1. Cells are equipped with controls that govern gene expression; that is, which gene products appear, when, and what amounts. When control mechanisms come into play depends on cell type, on prevailing chemical conditions, and on signals from other cell types that can change a target cell’s activities. Cells of complex organisms inherit the same genes, yet most become specialized in composition, structure, and function. This process of cell differentiation arises when different populations of cells activate and suppress their genes in highly selective, unique ways. 2. By negative control, regulatory proteins slow down or curtail gene activity. By positive control, regulatory proteins promote or enhance gene activities. Control is exerted through chemical modifications that inactivate or activate specific gene regions or the histone proteins that organize the DNA. For instance, regions of newly replicated DNA can be shut down by methylation, the attachment of methyl group to nucleotide bases. 3. A. Repressor protein: protein that binds with an operator on bacterial DNA to block transcription. A special regulating protein formed in bacterial cells that halt transcription, which is the synthesis of messenger ribonucleic acid (m-RNA) from a specific operon (a group of genes that carry out the synthesis of functionally related enzymes). The number of different repressors corresponds to the number...

Words: 1649 - Pages: 7

Premium Essay

Designer Babies Technology

...Karen Bleile Genetic Report for Designer Babies Biology 111-112 / J. Thomson October 30, 2012 In the 21 century the term designer babies made the transition from sci-fi movies into the real world where it is defined as “a baby whose genetic makeup has been artificially selected by genetic engineering combined with invitro fertilization to ensure the presence or absence of particular genes or characteristics.” (Oxford University, 2005)  Breeders of animals and plants have been using this technique to “produce organisms that will possess desirable characteristics, such as high crop yields, resistance to disease, high growth rate and many other phenotypical characteristics that will benefit the organism and species in the long term.” (Butler, 2012) We need to pause and ask ourselves if we start to cross breed the human race what are the moral or ethical limits, if any, and should we apply guidelines in allowing people to choose their children’s genes or characteristics when designing a baby. In many science fiction movies the subject of having genetically modified humans or creatures are often brought up. The creative author has the ability to give their characters any traits they want. They have the power to make the characters stronger, faster, more intelligent, to specify how they look, or simply be superhuman. Now imagine the scenario if the stories cease to lie in just the films we...

Words: 2073 - Pages: 9

Premium Essay

Nerd Rules

...Good evening, ladies and gentlemen. Um, I’ve been asked to introduce myself before introducing the speaker, which is the main purpose of my being here. Um, I’m, ah, Jean Thomas. I’m the new Biological Secretary of the Royal Society, and I’m pressure biochemistry in-in Cambridge. Ah, the other thing I’ve been asked to do, ah, before I have the pleasure of chairing this lecture, is to ask you all to switch your mobiles off if you would please. And, also to tell you that, ah, this lecture will be recorded and, ah, webcast. So, this is the, uh, Francis Crick, uh, prize lecture, um, we’re about to hear. This was established in, uh, 2003 through an endowment by, uh, Dr. Sydney Brenner, commander of honor and, ah, fellow of the Royal Society, who was the joint winner of the Nobel Prize for Physiology or Medicine in 2002. The lecture is delivered annually in any field of biological sciences, but preference is given to the general areas in which Francis Crick himself worked—uh, genetics, molecular biology, and neurobiology. And you’ll see that this, ah, evening’s speaker is imminently well-suited to talk to us. So, the Francis Crick lecturer, ah, for this year is Dr. Simon Fisher. Dr. Fisher is, ah, a Royal Society, ah, research fellow, um, so he’s, ah, a fellow of one of our flagship, ah, schemes for young scientists in the society, and he’s also reader in molecular and neuroscience in Oxford at the Welcomme Trust Centre for Human Genetics. Ah, he’s head there of the molecular...

Words: 14101 - Pages: 57