Chou-Fasman Algorithm for Protein Structure Prediction
1. INTRODUCTION
importance of protein structure in understanding the biological and chemical activities of organisms. Understanding about the proteins is important in various occasions such as finding cure for illnesses designing new chemical formulas and in studies on food and nutrition. The experimental methods used by biotechnologists to determine the structures of proteins demand sophisticated equipment and time. A host of c omputational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results. Chou-Fasman algorithm is an empirical algorithm developed for the prediction of protein secondary structure
1.1 Proteins
Examples of proteins:
a) Protective Proteins, for example, keratin (nails). b) Defence Proteins, for example, antibodies.
c) Toxins, for example, snake venom.
d) Structural Proteins, for example, collagen of bones. e) Enzymes (biocatalysts), for example, pepsin, trypsin. g) Hormones, for example, insulin is a protein.
1.2 Structure of Protein
-NH2
|
+
|
-COOH
|
=
|
-CONH-
|
Amino group
(Amino acid 1)
|
Carboxylic group
(Amino acid 2)
|
Peptide Bond
|
Amino acids are the basic building blocks of proteins. Fundamentally, amino acids are joined together by peptide bonds to form the basic structure of proteins.
Amino acids play central roles both as building blocks of proteins and as intermediates in metabolism. The 20 amino acids that are found within proteins convey a vast array of chemical versatility.
The chemical properties of the amino acids of proteins determine the biological activity of the protein. In addition, proteins contain within their amino acid sequences the necessary information to determine how that protein will fold into a three dimensional structure, and the stability of the resulting structure.
1.3 Amino Acids
Fig. 1 A generic Amino acid Structure |
Amino acids get together and form peptides or polypeptides. It is from these groupings that proteins are made. Commonly recognized amino acids include glutamine, glycine, phenylalanine, tryptophan, and valine. Three of
those phenylalanine, tryptophan, and valine are essential amino acids for humans; the others are isoleucine, leucine, lysine, methionine, and threonine.
Amino acids are carbon compounds that contain two functional groups: an amino group (NH2) and a carboxylic acid group (COOH). A side chain attached to the compound gives each amino acid a unique set of characteristics. It got another R part which may different for each amino acid.
2. INVESTIGATING THE PROTEIN STRUCTURE
Fig. 2 Different representations of protein structure |
• Primary Structure is the sequence of amino acids in the protein. Counting of residues always starts at the N- terminal end (NH2-group), which is the end where the amino group is involved in a peptide bond. The primary structure of a protein is determined by the gene corresponding to the protein.
• Secondary Structure is the composition of common patterns in the protein. Some patterns are frequently observed in the native states of proteins. This structure class includes regions in the protein of these patterns but it does not include the coordinates of residues.
• Tertiary Structure is the native state, or folded form, of a single protein chain. This form is also called the functional form. Tertiary structure of a protein includes the coordinates of its residues in three dimensional spaces. The elements of secondary structure are usually folded into a compact shape using a variety of loops and turns.
• Quaternary Structure is the structure of a protein complex. Some proteins form a large assembly to function. This form includes the position of the protein subunits of the assembly with respect to each other.
3. SECONDARY STRUCTURE PREDICTION
Given a protein sequence with amino acids a1, a2. . . an, the secondary structure prediction problem is to predict whether each amino acid ai is in a α−helix, a β−sheet, or neither. If we know (say through structural studies), the actual secondary structure for each amino acid, then the 3-state accuracy is the percent of residues for which our prediction matches reality. It is called “3-state” because each residue can be in one of 3 “states”: α, β, or other (O). Because there are only 3 states, random guessing would yield a 3-state accuracy of about 33% assuming that all structures are equally likely. There are different methods of prediction with various accuracies. Some of these methods are:3.1 GOR Method
The GOR method, named for the three scientists who developed it – Garnier, Osguthorpe,and Robson. Considering the information carried by a residue about its own secondary structure is used here, in combination with the information carried by other residues in a local window of eight residues on either side. Here the sequence of the residue concerned.The accuracy of these early methods based on the local amino acid composition of single sequences was fairly low, with often less than 60% of residues being produced in the correct secondary structure state.
3.2 PHD
The neural net model employed by Rost and Sander was fairly complex and computationally expensive. Because of the computational demands, a 7-fold cross-validation was used in place of jack-knife testing. Accuracy was over
70% using multiple sequence alignment, but the fifth of residues with the highest reliability was predicted with over
90% accuracy. Rost and Sander also tested PHD on 26 new proteins, none with significant sequence similarity to any protein in the training set, and found comparable results. PHD, however, suffers from some problems. Rost and Sander were concerned with overtraining and therefore terminated training once the accuracy was higher than 70% for all training samples.
3.2 Chou- Fasman Method
The Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. In this method, a helix is predicted if, in a run of six residues, four are helix favouring and the average valued of the helix propensity is greater than100 and greater than the average strand propensity. Such a helix is extended along the sequence until a proline is encountered (helix breaker) or a run of 4 residues with helical propensity less than 100 is found. A strand is predicted if, in a run of 5 residues, three are strand favouring, and the average value of the strand propensity is greater than 1.04 and greater than the average helix propensity. Such a strand is extended along the sequence until a run of 4 residues with strand propensity less than 100 is found.
3.3 Data Mining Model used for implementation of the CHOU- FASMAN method
As a part of the larger process known as knowledge discovery, data mining is the process of extracting information from large volumes of data. This is achieved through the identification and analysis of relationships and trends within commercial databases. Data mining is used in areas as diverse as space exploration and medical research. Here we gather data considering the known protein structures. The predictions are concerned with the data we already have. We compare the particular values of the protein we want to predict the structure and thus do the prediction on its structure
4. CHOU-FASMAN METHOD FOR PROTEIN STRUCTURE PREDICTION
The Chou-Fasman algorithm for the prediction of protein secondary structure is one of the most widely used predictive schemes. The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to the conformational parameters and positional frequencies. The Chou-Fasman algorithm is simple in principle.The conformational parameters for each amino acid were calculated by considering the relative frequency of a given amino acid within a protein, its occurrence in a given type of secondary structure, and the fraction of residues occurring in that type of structure. These parameters are measures of a given amino acid's propensity (preference to be found in helix, sheet or coil). Using these conformational parameters, one finds nucleation sites within the sequence and extends them until a stretch of amino acids is encountered that is not disposed to occur in that type of structure or until a stretch is encountered that has a greater disposition for another type of structure. At that point, the structure is terminated. This process is repeated throughout the sequence until the entire sequence is predicted.
4.1 Propensity value
To predict secondary structure of a protein using Chou-Fasman method from a primary sequence require the knowledge of propensity value. Simply propensity value is the tendency of an amino acid to be present in α-helix or β –sheet. Suppose an amino acid which is having a higher propensity value for α (P(α)).That means that amino acid is alerted to be present in α-helix more than it is to be present in β-sheets. Similarly in the case of β-sheets also. Propensity value is depicted as P. So the propensity value for β will be Pβlikewise.4.2 Calculation of the propensity value
4.3.1 α-Helix/ β -Sheets Nucleation
It is regarding the tendency to make helixes/sheets in our amino acids.it depends on how many α-helix / β-sheet makers and α-helix/ β-sheet breakers in the amino acid sequence we study. Normally an amino acid becomes a breaker or a maker because of the R part it got. According to the R part in the Amino acid it isdetermined whether the particular amino acid is mostly in α-helix or β -sheet. So we use this concept to determine the secondary structure in Chou-Fasman Method.
The concept is for α-helix is:
More than 1/3 (>1/3 in this case 2) of α-helix breakers it should not form an α-helix.
1>
4.3.2 α-Helix/ β -Sheets Termination
This is about the calculating the ending point of a α--helix or a β -sheet in the sequence here also we consider randomly a contiguous 6 residues and we apply the above rule and then consider this for the both
directions from our selected residues set adding one amino acid for a once and when we get 4 times residues having their propensity value less than 100 we can say that our structure end its α-helix from here or end its β - sheet from here and we check for a new sequence.
4.3.3 α-Helix/ β -Sheets overlapping comparison
When the particular sequence having both α-helix and β -sheets we need to determined which will it get
the most this is decided using the Pa and P β values if Pa is high we say its alpha if P β is high it in β .
So the conditions we check in this method are
H α<>B α
P α<1 .03="" br="" or="" p="">P α<>P β
and if we found four consecutives with Pα or P β less than 100 we say this segment is end from here and it gets this structure(may be α or β or may be not both???)
4.4The Algorithm
The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to those numbers.
The algorithm contains the following steps:
(a) Assign parameter values to all residues of the Peptide.
(b) Scan the peptide and identify regions where 4 out of 6 contiguous residues have P(α)>100.Theseregions nucleate α- helices. Extend these in both directions until a set of four contiguous residues have an average P(α)<100 .this="" br="" ends="" helix.="" the="">
(c) Scan the peptide and identify regions where 3 out of 5 contiguous residues have P(β)>100.These residues nucleate β- strands. Extend these in both directions until a set of four contiguous residues have an average P(β)<100 .this="" br="" ends="" strand.="">
(d) Any region containing overlapping α and β assignments are taken to be helical or β depending on if the average P(α) and P(β) for that region is largest. If this residues an α or β- region so that it becomes less than 5 residues, the α or β assignment for that region is removed.
(e) To identify a β-turn at residue number i, the product p(t) = f(i)f(i+1)f(i+2)f(i+3) is calculated. To predict a β- turn, the following three conditions have to be simultaneously fulfilled:
p (t)>0.000075
p(t) = f(i)f(i+1)f(i+2)f(i+3) .
Where the f(i+1) value for the i+1 residue is used, the f(i+2) value for the i+2 residue is used and the f(i+3) value for the i+3 residue is used
• The average value for P (turn)>100 for four amino acids.
• The average P (turn) is larger than the average P (α) as well as P(β).
(f) The remaining part of the sequence without Assignment = are considered as coils.
Α-HELIX, ß-SHEET AND TURN RESIDUES.
Name
|
P(a)
|
P(b)
|
P(turn)
|
f(i)
|
f(i+1)
|
f(i+2)
|
f(i+3)
|
A-Alanine
|
142
|
83
|
66
|
0.060
|
0.076
|
0.035
|
0.058
|
R-Arginine
|
98
|
93
|
95
|
0.070
|
0.106
|
0.099
|
0.085
|
N-Asparticacid
|
101
|
54
|
146
|
0.147
|
0.110
|
0.179
|
0.081
|
D-Asparagine
|
67
|
89
|
156
|
0.161
|
0.083
|
0.191
|
0.091
|
C-Cysteine
|
70
|
119
|
119
|
0.149
|
0.050
|
0.117
|
0.128
|
E-Glumaticacid
|
151
|
37
|
74
|
0.056
|
0.060
|
0.077
|
0.064
|
Q-Glutamine
|
111
|
110
|
98
|
0.074
|
0.098
|
0.037
|
0.098
|
G-Glycine
|
57
|
75
|
156
|
0.102
|
0.085
|
0.190
|
0.152
|
H-Histidine
|
100
|
87
|
95
|
0.140
|
0.047
|
0.093
|
0.054
|
I-Isoleucine
|
108
|
160
|
47
|
0.043
|
0.034
|
0.013
|
0.056
|
L-Leucine
|
121
|
130
|
59
|
0.061
|
0.025
|
0.036
|
0.070
|
K-Lysine
|
114
|
74
|
101
|
0.055
|
0.115
|
0.072
|
0.095
|
M-Methionine
|
145
|
105
|
60
|
0.068
|
0.082
|
0.014
|
0.055
|
F-Phenylalanin e
|
113
|
138
|
60
|
0.059
|
0.041
|
0.065
|
0.065
|
P-Proline
|
57
|
55
|
152
|
0.102
|
0.301
|
0.034
|
0.068
|
S-Serine
|
77
|
75
|
143
|
0.120
|
0.139
|
0.125
|
0.106
|
T-Threonine
|
83
|
119
|
96
|
0.086
|
0.108
|
0.065
|
0.079
|
W-Tryptophan
|
108
|
137
|
96
|
0.077
|
0.013
|
0.064
|
0.167
|
Y-Tyrosine
|
69
|
147
|
114
|
0.082
|
0.065
|
0.114
|
0.125
|
V-Valine
|
106
|
170
|
50
|
0.062
|
0.048
|
0.028
|
0.053
|
4.4 Choice of sequence format
There are various formats of Amino acid sequences, and each has its own set of characters and utility. To get a deeper understanding and better results it is essential to choose a valid input format. The various formats are:• Plain text format
• FASTA format
• Genetic Computer Group Format (GCG)
• NEXUS
• NBRF &PIR
Ex:-Plain text format:
Plain Text format looks like the following:
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQEEEE MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSL RILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSF DSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQE.
5 RESULTS AND DISCUSSION
For a given sequence of amino acids, this technique first clusters the amino acids and then these amino acid clusters are analyzed to predict the structure of protein. The user inputs the primary structure of the protein i.e. The amino acid sequence. The clusters of amino acids are extended till a alpha-helix, beta helix or a turn are predicted using the conformational parameters and positional frequencies for α- helix, ß-sheet and turn residues.
The whole detailed method is explained below: Example:
INPQAIFDIQIKRLHEYKRQHHDKQVHMANLCVVGGFA VNGVAALHSDLVVKDLFPEYHQLWPNKFHNVTNGITP RRWIKQCNPALAALLDKSLQKEWANDLDQLINLVKLA DDAKFRQLYRVIKQANKVRLAEFVKVRTIDLNLLHILA LYKERIRENP
The above sequence is divided into clusters and from the table the conformational parameter and positional frequencies for α-helix, ß-sheet and turn residues are established.
Ex:-for first 6, the pα>100 so it may not be in α- helix and the P ß >100 so it may not be either in ß-sheet, and then we consider it for turns and we can consider the structure to be in turns. And to determine the end points we choose random from the sequence and do the algorithm for both sides until we meet the breaking criteria.so following is the Output after considering structure
Hence the final secondary structure of the given sequence is:
TTTBBBBBBBBBBBBBTTTTAAAAAAABBBBBBBTTTTTTTTTTTTTTTTTTTTTBBBBBTTAA AAAAAAAAAAAAAAATTBBBBBBTTTTTTTTTTTTBBBBBBTTTTAAAAAAAATTTTTTTTB BBBTTTT
6 RESULTS AND DISCUSSION
It attempts to classify amino acid in protein sequence according to their predicted local structure, which can be subdivided into three states: α-helix, β-sheet or turn.
• Protein fold can be predicted with better accuracy with this technique.
• Various other data mining techniques can be used to determine an optimum result.
• Choice of various formats of amino acid sequences can be utilized.
• Protein structure and protein function prediction can be done based on improved Chou-Fasman method which includes 4 amino acids enabling a reverse β- turn.
7 FUTURE SCOPE OF WORK
There are lot of researches going on still regarding the improvements of the Chou-Fasman algorithm and there are several modified algorithms can be found when we search.
Following improvements regarding the developed model of bioinformatics can be made:
• The system can be extended to predict the tertiary structure of the protein.
• Various different mining techniques can be utilized to determine the optimum result.
• Different formats of amino acids can be utilized.
• Protein fold can be predicted with better accuracy with using this technique.
• This technique can be further extended for multiple sequence alignment.
8 REFERENCES
[1] András Fiser, Andrej Sali (2000) “Comparative protein structure modeling” Pels Family Center for Biochemistry and Structural Biology,The Rockefeller University, pp 82-88.[2] Andreas Rechtsteiner, Jeremy Luinstra, Luis M Rocha, Charlie E M Strauss (2006) “Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology” Center of Genomics and Bioinformatics, Indiana University, Bloomington, IN 47401, pp 1-4.
[3] Ben Blum, Michael I. Jordan (2007) “Feature Selection Methods for Improving Protein Structure Prediction with Rosetta” Department of Electrical Engineering and Computer Science University of California at Berkeley, CA 94305, pp1-7.
[4] Chen Yonghui, Reilly Kevin D., Sprague Alan P., Guan Zhijie, “SEQOPTICS: a protein sequence clustering system” Symposium of Computations in Bioinformatics and Bioscience (SCBB06) in conjunction with the International Multi-Symposiums on Computer and Computational Sciences 2006 (IMSCCS|06) Hangzhou, China. June 20–24, 2006, pp 1-5.
[5] Eisen Michael B., Spellman Paul T., Brown Patrick O., Botstein David (1998) “Cluster analysis and display of genome-wide expression patterns” Proc. Natl. Acad. Sci. USA.Vol. 95, pp.14863–14868.
[6] Fraley Chris, Raftery Adrian E. (1998) “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis” The computer journal, Vol. 41, No. 8, 1998 pp 578-587.
[7] DSVGK Kaladhar (2012) “protein secondary structure prediction:an application of chou-fasman algorithmin a hypothetical protein of sars virus” Int. J. LifeSc. Bt & Pharm. Res.Vol.1, Issue. 1, January 2012pp 1-3.
[8] Fraley Chris, Raftery Adrian E. (2000) “Model based clustering, Discriminant Analysis, and density estimation” Working Paper no II, Center for statics and social science, University of Washington, USA, pp1-28.
[9] George Tzanis, Christos Berberidis, and Ioannis Vlahavas (2002) “Biological Data Mining” Department of Informatics, Aristotle University of Thessaloniki, Greece, pp 1-8.
100>100>1>
No comments :
Post a Comment