| Welcome to my blog!

Meet the Author

Roshan is an undergarduate student in Information and Communication Technologies from Vavuniya Campus of the University of Jaffna.He is very much interesting on IT related Topics and the technical stuffs.Roshan is also a creative mind with lots of ideas and potential writter..!

Subscribe to this blog!

Receive the latest posts by email. Just enter your email below if you want to subscribe!

Sunday, December 14, 2014

Chou-Fasman Algorithm for Protein Structure Prediction



Chou-Fasman Algorithm for Protein Structure Prediction 

you can get the related slide show from slide share by clicking here!

1. INTRODUCTION


Protein structure determination and prediction has being a vital area in the field of bioinformatics due to the

importance  of  protein  structure  in  understanding  the  biological  and  chemical  activities  of  organisms. Understanding about the proteins is important in various occasions such as finding cure for illnesses designing new chemical formulas and in studies on food and nutrition. The experimental methods used by biotechnologists to determine the  structures of  proteins  demand sophisticated  equipment and  time.  A  host of c omputational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results. Chou-Fasman algorithm is an empirical algorithm developed for the prediction of protein secondary structure



1.1 Proteins


Proteins are complex organic compounds that consist of amino acids joined by peptide  bonds. Proteins are essential to the structure and function of all living cells and viruses. Many proteins function as enzymes or form of sub units of enzymes. Their role may be different according to their structure. Some proteins play structural or mechanical roles or some proteins function in immune response and the storage and transport of various ligands. Proteins serve as nutrients as well; they provide the organism with the amino acid that are not synthesized by that organism. Proteins are amongst the most actively studied molecules in  biochemistry. An amino acid is any molecule that contain both an amino group and a carboxylic acid group. An amino acid residue is the residuals of an amino acid after it forms a peptide bond and loses a water molecule. Since we are interested in amino acids that form proteins, it is safe to use the terms residue and amino acid interchangeably. There  are  20  different  amino acids  in nature  that  form proteins.



Examples of proteins:

a)  Protective Proteins, for example, keratin (nails). b)  Defence Proteins, for example, antibodies.

c) Toxins, for example, snake venom.

d)  Structural Proteins, for example, collagen of bones. e)  Enzymes (biocatalysts), for example, pepsin, trypsin. g)  Hormones, for example, insulin is a protein.







1.2 Structure of Protein


-NH2
+
-COOH
=
-CONH-
Amino group
(Amino acid 1)

Carboxylic group
(Amino acid 2)

Peptide Bond


A chain of such peptide bonds is called polypeptide and is a protein.

Amino acids are the basic building blocks of proteins. Fundamentally, amino acids are joined together by peptide bonds to form the basic structure of proteins.

Amino acids play central roles both as building blocks of proteins and as intermediates in metabolism.   The 20 amino acids that are found within proteins convey a vast array of chemical versatility.

The chemical  properties  of the amino acids  of  proteins  determine  the  biological  activity  of  the protein. In addition, proteins contain within their amino acid sequences the necessary information to determine how that protein will fold into a three dimensional structure, and the stability of the resulting structure.

1.3 Amino Acids


Fig. 1 A generic Amino acid Structure
As amino acids bind together in chains to form the stuff from which our life is born. It's a two-step process:

Amino acids get together and form peptides or polypeptides. It is from these groupings that proteins are made. Commonly recognized amino acids include glutamine, glycine, phenylalanine, tryptophan, and valine. Three of

those phenylalanine, tryptophan, and valine are essential amino acids for humans; the others are isoleucine, leucine, lysine, methionine, and threonine.

Amino acids are carbon compounds that contain two functional groups:  an amino group (NH2) and a carboxylic acid group (COOH). A side chain attached to the compound gives each amino acid a unique set of characteristics. It got another R part which may different for each amino acid.

2. INVESTIGATING THE PROTEIN STRUCTURE  

Structures of proteins are investigated under four primary groups:
Fig. 2 Different representations of protein structure




•     Primary Structure is the sequence of amino acids in the protein. Counting of residues always starts at the N- terminal end (NH2-group), which is the end where the amino group is involved in a peptide bond. The primary structure of a protein is determined by the gene corresponding to the protein.

•     Secondary     Structure     is     the     composition     of     common patterns in the protein. Some patterns are frequently observed in the native states of proteins. This structure class includes regions in the protein of these patterns but it does not include the coordinates of residues.

•     Tertiary Structure is the native state, or folded form, of a single protein chain. This form is also called the functional form. Tertiary structure of a protein includes the coordinates of its residues in three dimensional spaces. The elements of secondary structure are usually folded into a compact shape using a variety of loops and turns.

•     Quaternary    Structure    is    the    structure    of    a    protein complex.    Some    proteins    form    a    large assembly   to function.  This form includes the position of the protein subunits of the assembly with respect to each other.



3. SECONDARY STRUCTURE PREDICTION

Given a protein sequence with amino acids a1, a2. . . an, the secondary structure prediction problem is to predict whether each amino acid ai is in a α−helix, a β−sheet, or neither. If we know (say through structural studies), the actual secondary structure for each amino acid, then the 3-state accuracy is the percent of residues for which our prediction matches reality. It is called “3-state” because each residue can be in one of 3 “states”: α, β, or other (O). Because there are only 3 states, random guessing would yield a 3-state accuracy of about 33% assuming that all structures are equally likely. There are different methods of prediction with various accuracies. Some of these methods are:



3.1 GOR Method

The  GOR  method,  named  for  the  three  scientists  who  developed  it  –  Garnier,  Osguthorpe,and  Robson. Considering the information carried by a residue about its own secondary structure is used here, in combination with the information carried by other residues in a local window of eight residues on either side. Here the sequence of the residue concerned.

The  accuracy of  these  early  methods  based  on  the  local amino acid composition  of single sequences  was fairly low, with often less than 60% of residues being produced in the correct secondary structure state.

 3.2 PHD


The neural net model employed by Rost and Sander was fairly complex and computationally expensive. Because of the computational demands, a 7-fold cross-validation was used in place of jack-knife testing. Accuracy was over

70% using multiple sequence alignment, but the fifth of residues with the highest reliability was predicted with over

90% accuracy. Rost and Sander also tested PHD on 26 new proteins, none with significant sequence similarity to any protein in the training set, and found comparable results.  PHD, however, suffers from some problems. Rost and Sander were concerned with overtraining and therefore terminated training once the accuracy was higher than 70% for all training samples.

3.2 Chou- Fasman Method


The Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. In this method, a helix is predicted if, in a run of six residues, four are helix favouring and the average valued of the helix propensity is greater than100 and greater than the average strand propensity. Such a helix is extended along the sequence until a proline is encountered (helix breaker) or a run of 4 residues with helical propensity less than 100 is found. A strand is predicted if, in a run of 5 residues, three are strand favouring, and the average value of the strand propensity is greater than 1.04 and greater than the average helix propensity. Such a strand is extended along the sequence until a run of 4 residues with strand propensity less than 100 is found.

3.3 Data Mining Model used for implementation of the CHOU- FASMAN method


 As  a  part  of  the  larger  process  known  as  knowledge  discovery,  data  mining  is  the  process  of  extracting information from large volumes of data. This is achieved through the identification and analysis of relationships and trends within commercial databases. Data mining is used in areas as diverse as space exploration and medical research. Here we gather data considering the known protein structures. The predictions are concerned with the data we already have. We compare the particular values of the protein we want to predict the structure and thus do the prediction on its structure



4. CHOU-FASMAN METHOD FOR PROTEIN STRUCTURE PREDICTION

The Chou-Fasman algorithm for the prediction of protein secondary structure is one of the most widely used predictive schemes. The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to the conformational parameters and positional frequencies. The Chou-Fasman algorithm is simple in principle.

The conformational parameters for each amino acid were calculated by considering the relative frequency of a given amino acid within a protein, its occurrence in a given type of   secondary   structure,   and   the   fraction   of residues occurring in that type of structure. These parameters are measures of a given amino acid's propensity (preference to be found in helix, sheet or coil). Using these conformational parameters, one finds nucleation sites within the sequence and extends them until a stretch of amino acids is encountered that is not disposed to occur in that type of structure or until a stretch is encountered  that has a greater disposition for another type of structure. At that point, the structure is terminated. This process is repeated throughout the sequence until the entire sequence is predicted.



4.1 Propensity value

To predict secondary structure of a protein using Chou-Fasman method from a primary sequence require the knowledge of propensity value. Simply propensity value is the tendency of an amino acid to be present in α-helix or β –sheet. Suppose an amino acid which is having a higher propensity value for α (P(α)).That means that amino acid is alerted to be present in α-helix more than it is to be present in β-sheets. Similarly in the case of β-sheets also. Propensity value is depicted as P. So the propensity value for β will be Pβlikewise.



4.2 Calculation of the propensity value



4.3.1 α-Helix/ β -Sheets Nucleation

It is regarding the tendency to make helixes/sheets in our amino acids.it depends on how many α-helix / β-sheet makers and α-helix/ β-sheet breakers in the amino acid sequence we  study. Normally an amino acid becomes a breaker or a maker because of the R part it got. According to the R part in the Amino acid it is
determined whether the particular amino acid is mostly in α-helix or β -sheet. So we use this concept to determine the secondary structure in Chou-Fasman Method.

       The concept is for α-helix is:
            If 6 contiguous residues have
                                 More than 1/3 (>1/3 in this case 2) of α-helix breakers it should not form an α-helix.
                                 Less than ½ (<1 3="" an="" case="" div="" form="" helix="" in="" it="" makers="" not="" of="" should="" this="">
   The concept is for β -sheet is:
             If 5 contiguous residues have
                                 More than 1/3 (>1/3) of β -sheet breakers it should not form a β -sheet
                             Less than ½ (<1 -sheet="" a="" br="" form="" it="" makers="" not="" of="" should="">



4.3.2 α-Helix/ β -Sheets Termination


This is about the calculating the ending point of a α--helix or a β -sheet in the sequence here also we consider randomly a contiguous 6 residues and we apply the above rule and then consider this for the both

directions from our selected residues set adding one amino acid for a once and when we get 4 times residues having their propensity value less than 100 we can say that our structure end its α-helix from here or end its β - sheet from here and we check for a new sequence.



4.3.3 α-Helix/ β -Sheets overlapping comparison


When the particular sequence having both α-helix and β -sheets we need to determined which will it get

the most this is decided using the Pa and P β values if Pa is high we say its alpha if P β is high it in β .


So the conditions we check in this method are

H α<>B α
P α<1 .03="" br="" or="" p="">P α<>P β

and if we found four consecutives with Pα or P β  less than 100 we say this segment is end from here and it gets this structure(may be α or β or may be not both???)



4.4The Algorithm


The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to those numbers.

The algorithm contains the following steps:
(a)   Assign    parameter   values   to   all  residues  of   the Peptide.

(b)  Scan the peptide and identify regions where 4 out of  6 contiguous residues have P(α)>100.Theseregions nucleate α- helices. Extend these in both directions until a set of four contiguous residues have an average P(α)<100 .this="" br="" ends="" helix.="" the="">
(c)   Scan the peptide and identify regions where 3 out of 5 contiguous residues have P(β)>100.These   residues nucleate β- strands. Extend these in both directions until a set of four contiguous residues have an average P(β)<100 .this="" br="" ends="" strand.="">
(d) Any region containing overlapping α and β assignments are taken to be helical or β depending on if the average P(α) and P(β) for that region is largest. If this residues an α or β- region so that it becomes less than 5 residues, the α or β assignment for that region is removed.

(e) To identify a β-turn at residue number i, the product  p(t) = f(i)f(i+1)f(i+2)f(i+3)  is calculated. To predict a β- turn, the following three conditions have to be simultaneously fulfilled:

p (t)>0.000075
p(t) = f(i)f(i+1)f(i+2)f(i+3) .

Where  the f(i+1) value for the i+1 residue is used, the f(i+2)  value  for  the  i+2  residue  is used  and  the f(i+3) value for the i+3 residue is used

•    The average value for P (turn)>100 for four amino acids.
•    The average P (turn) is larger than the average P (α) as well as P(β).



(f)   The   remaining part of the sequence without Assignment = are considered as coils.


TABLE I CONFORMATIONAL PARAMETERS  AND POSITIONAL FREQUENCIESOR

Α-HELIX, ß-SHEET AND TURN RESIDUES.
Name
P(a)
P(b)
P(turn)
f(i)
f(i+1)
f(i+2)
f(i+3)
A-Alanine
142
83
66
0.060
0.076
0.035
0.058
R-Arginine
98
93
95
0.070
0.106
0.099
0.085
N-Asparticacid
101
54
146
0.147
0.110
0.179
0.081
D-Asparagine
67
89
156
0.161
0.083
0.191
0.091
C-Cysteine
70
119
119
0.149
0.050
0.117
0.128
E-Glumaticacid
151
37
74
0.056
0.060
0.077
0.064
Q-Glutamine
111
110
98
0.074
0.098
0.037
0.098
G-Glycine
57
75
156
0.102
0.085
0.190
0.152
H-Histidine
100
87
95
0.140
0.047
0.093
0.054
I-Isoleucine
108
160
47
0.043
0.034
0.013
0.056
L-Leucine
121
130
59
0.061
0.025
0.036
0.070
K-Lysine
114
74
101
0.055
0.115
0.072
0.095
M-Methionine
145
105
60
0.068
0.082
0.014
0.055
F-Phenylalanin e
113
138
60
0.059
0.041
0.065
0.065
P-Proline
57
55
152
0.102
0.301
0.034
0.068
S-Serine
77
75
143
0.120
0.139
0.125
0.106
T-Threonine
83
119
96
0.086
0.108
0.065
0.079
W-Tryptophan
108
137
96
0.077
0.013
0.064
0.167
Y-Tyrosine
69
147
114
0.082
0.065
0.114
0.125
V-Valine
106
170
50
0.062
0.048
0.028
0.053

 4.4 Choice of sequence format

There are various formats of Amino acid sequences, and each has its own set of characters and utility. To get a deeper understanding and better results it is essential to choose a valid input format. The various formats are:

•    Plain text format
•    FASTA format
•    Genetic Computer Group Format (GCG)
•     NEXUS
•    NBRF &PIR


Ex:-Plain text format:

Plain Text format looks like the following:

MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQEEEE MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSL RILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSF DSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQE.

5 RESULTS AND DISCUSSION


 For a given sequence of amino acids, this technique first clusters the amino acids and then these amino acid clusters are analyzed to predict the structure of protein. The user inputs the primary structure of the protein i.e. The amino acid sequence.  The  clusters  of  amino  acids  are  extended  till  a alpha-helix, beta helix or a turn are predicted using the conformational parameters and positional frequencies for α- helix, ß-sheet and turn residues.

The whole detailed method is explained below: Example:


INPQAIFDIQIKRLHEYKRQHHDKQVHMANLCVVGGFA VNGVAALHSDLVVKDLFPEYHQLWPNKFHNVTNGITP RRWIKQCNPALAALLDKSLQKEWANDLDQLINLVKLA DDAKFRQLYRVIKQANKVRLAEFVKVRTIDLNLLHILA LYKERIRENP


The above sequence is divided into clusters and from the table the conformational parameter and positional frequencies for α-helix, ß-sheet and turn residues are established.

Ex:-for first 6, the pα>100 so it may not be in α- helix and the P ß >100 so it may not be either in ß-sheet, and then we consider it for turns and we can consider the structure to be in turns. And to determine the end points we choose random from the sequence and do the algorithm for both sides until we meet the breaking criteria.so following is the Output after considering structure

Hence the final secondary structure of the given sequence is:



TTTBBBBBBBBBBBBBTTTTAAAAAAABBBBBBBTTTTTTTTTTTTTTTTTTTTTBBBBBTTAA AAAAAAAAAAAAAAATTBBBBBBTTTTTTTTTTTTBBBBBBTTTTAAAAAAAATTTTTTTTB BBBTTTT



6 RESULTS AND DISCUSSION


It attempts to classify amino acid in protein sequence according to their predicted local structure, which can be subdivided into three states: α-helix, β-sheet or turn.

•     Protein fold can be predicted with better accuracy with this technique.
•     Various other data mining techniques can be used to determine an optimum result.
•     Choice of various formats of amino acid sequences can be utilized.
•    Protein structure and protein function prediction can be done based on improved Chou-Fasman method which includes 4 amino acids enabling a reverse β- turn.

7 FUTURE SCOPE OF WORK


There are lot of researches going on still regarding the improvements of the Chou-Fasman algorithm and there are several modified algorithms can be found when we search.

Following improvements regarding the developed model of bioinformatics can be made:

•     The system can be extended to predict the tertiary structure of the protein.
•     Various different mining techniques can be utilized to determine the optimum result.
•    Different formats of amino acids can be utilized.
•     Protein fold can be predicted with better accuracy with using this technique.
•     This technique can be further extended for multiple sequence alignment.

8 REFERENCES

[1]   András Fiser, Andrej Sali (2000) “Comparative protein structure modeling” Pels Family Center for Biochemistry and Structural Biology,The Rockefeller University, pp 82-88.
[2]    Andreas  Rechtsteiner,  Jeremy  Luinstra,  Luis  M  Rocha,  Charlie  E  M Strauss (2006) “Use of Text Mining for Protein Structure Prediction and Functional    Annotation    in  Lack  of  Sequence Homology”   Center  of Genomics  and  Bioinformatics,   Indiana  University,  Bloomington,   IN 47401, pp 1-4.
[3]    Ben Blum, Michael I. Jordan (2007) “Feature Selection Methods for Improving Protein Structure Prediction with Rosetta” Department of Electrical  Engineering  and Computer Science University of California at Berkeley, CA 94305, pp1-7.
[4]  Chen Yonghui, Reilly Kevin D., Sprague Alan P., Guan Zhijie, “SEQOPTICS: a protein sequence clustering system” Symposium of Computations  in  Bioinformatics  and  Bioscience  (SCBB06)  in conjunction with the International Multi-Symposiums   on Computer and Computational Sciences 2006 (IMSCCS|06) Hangzhou, China. June 20–24, 2006, pp 1-5.
[5]    Eisen Michael B., Spellman Paul T., Brown Patrick O., Botstein David (1998)   “Cluster   analysis and    display    of    genome-wide   expression  patterns” Proc. Natl. Acad. Sci. USA.Vol.  95, pp.14863–14868.
[6]   Fraley Chris, Raftery Adrian E. (1998) “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis” The computer journal, Vol. 41, No. 8, 1998 pp 578-587.
[7]   DSVGK Kaladhar (2012) “protein secondary structure prediction:an application of chou-fasman algorithmin a hypothetical protein of sars virus” Int. J. LifeSc. Bt & Pharm. Res.Vol.1, Issue. 1, January 2012pp 1-3.
[8]  Fraley Chris, Raftery Adrian E. (2000) “Model based clustering, Discriminant Analysis, and density estimation”   Working Paper no II, Center for statics and social science, University of Washington, USA, pp1-28.
[9]   George   Tzanis,   Christos   Berberidis,   and   Ioannis   Vlahavas   (2002) “Biological  Data Mining” Department  of Informatics,  Aristotle University of Thessaloniki, Greece, pp 1-8.
Roshan Karunarathna Web Developer,Programer

Roshan is an undergarduate student in Information and Communication Technologies from Vavuniya Campus of the University of Jaffna.He is very much interesting on IT related Topics and the technical stuffs.Roshan is also a creative mind with lots of ideas and potential writter..!