Chou-Fasman Algorithm for Protein Structure Prediction

you can get the related slide show from slide share by clicking here!

1. INTRODUCTION

Protein structure determination and prediction has being a vital area in the field of bioinformatics due to the

importance of protein structure in understanding the biological and chemical activities of organisms. Understanding about the proteins is important in various occasions such as finding cure for illnesses designing new chemical formulas and in studies on food and nutrition. The experimental methods used by biotechnologists to determine the structures of proteins demand sophisticated equipment and time. A host of c omputational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results. Chou-Fasman algorithm is an empirical algorithm developed for the prediction of protein secondary structure

1.1 Proteins

Proteins are complex organic compounds that consist of amino acids joined by peptide bonds. Proteins are essential to the structure and function of all living cells and viruses. Many proteins function as enzymes or form of sub units of enzymes. Their role may be different according to their structure. Some proteins play structural or mechanical roles or some proteins function in immune response and the storage and transport of various ligands. Proteins serve as nutrients as well; they provide the organism with the amino acid that are not synthesized by that organism. Proteins are amongst the most actively studied molecules in biochemistry. An amino acid is any molecule that contain both an amino group and a carboxylic acid group. An amino acid residue is the residuals of an amino acid after it forms a peptide bond and loses a water molecule. Since we are interested in amino acids that form proteins, it is safe to use the terms residue and amino acid interchangeably. There are 20 different amino acids in nature that form proteins.

Examples of proteins:

a) Protective Proteins, for example, keratin (nails). b) Defence Proteins, for example, antibodies.

c) Toxins, for example, snake venom.

d) Structural Proteins, for example, collagen of bones. e) Enzymes (biocatalysts), for example, pepsin, trypsin. g) Hormones, for example, insulin is a protein.

1.2 Structure of Protein

-NH2	+	-COOH	=	-CONH-
Amino group (Amino acid 1)		Carboxylic group (Amino acid 2)		Peptide Bond

A chain of such peptide bonds is called polypeptide and is a protein.

Amino acids are the basic building blocks of proteins. Fundamentally, amino acids are joined together by peptide bonds to form the basic structure of proteins.

Amino acids play central roles both as building blocks of proteins and as intermediates in metabolism. The 20 amino acids that are found within proteins convey a vast array of chemical versatility.

The chemical properties of the amino acids of proteins determine the biological activity of the protein. In addition, proteins contain within their amino acid sequences the necessary information to determine how that protein will fold into a three dimensional structure, and the stability of the resulting structure.

1.3 Amino Acids

Fig. 1 A generic Amino acid Structure

As amino acids bind together in chains to form the stuff from which our life is born. It's a two-step process:

Amino acids get together and form peptides or polypeptides. It is from these groupings that proteins are made. Commonly recognized amino acids include glutamine, glycine, phenylalanine, tryptophan, and valine. Three of

those phenylalanine, tryptophan, and valine are essential amino acids for humans; the others are isoleucine, leucine, lysine, methionine, and threonine.

Amino acids are carbon compounds that contain two functional groups: an amino group (NH2) and a carboxylic acid group (COOH). A side chain attached to the compound gives each amino acid a unique set of characteristics. It got another R part which may different for each amino acid.

2. INVESTIGATING THE PROTEIN STRUCTURE

Structures of proteins are investigated under four primary groups:


Fig. 2 Different representations of protein structure

•     Primary Structure is the sequence of amino acids in the protein. Counting of residues always starts at the N- terminal end (NH2-group), which is the end where the amino group is involved in a peptide bond. The primary structure of a protein is determined by the gene corresponding to the protein.

•     Secondary     Structure     is     the     composition     of     common patterns in the protein. Some patterns are frequently observed in the native states of proteins. This structure class includes regions in the protein of these patterns but it does not include the coordinates of residues.

•     Tertiary Structure is the native state, or folded form, of a single protein chain. This form is also called the functional form. Tertiary structure of a protein includes the coordinates of its residues in three dimensional spaces. The elements of secondary structure are usually folded into a compact shape using a variety of loops and turns.

•     Quaternary    Structure    is    the    structure    of    a    protein complex.    Some    proteins    form    a    large assembly   to function. This form includes the position of the protein subunits of the assembly with respect to each other.

3. SECONDARY STRUCTURE PREDICTION

Given a protein sequence with amino acids a1, a2. . . an, the secondary structure prediction problem is to predict whether each amino acid ai is in a α−helix, a β−sheet, or neither. If we know (say through structural studies), the actual secondary structure for each amino acid, then the 3-state accuracy is the percent of residues for which our prediction matches reality. It is called “3-state” because each residue can be in one of 3 “states”: α, β, or other (O). Because there are only 3 states, random guessing would yield a 3-state accuracy of about 33% assuming that all structures are equally likely. There are different methods of prediction with various accuracies. Some of these methods are:

3.1 GOR Method

The GOR method, named for the three scientists who developed it – Garnier, Osguthorpe,and Robson. Considering the information carried by a residue about its own secondary structure is used here, in combination with the information carried by other residues in a local window of eight residues on either side. Here the sequence of the residue concerned.

The accuracy of these early methods based on the local amino acid composition of single sequences was fairly low, with often less than 60% of residues being produced in the correct secondary structure state.

3.2 PHD

The neural net model employed by Rost and Sander was fairly complex and computationally expensive. Because of the computational demands, a 7-fold cross-validation was used in place of jack-knife testing. Accuracy was over

70% using multiple sequence alignment, but the fifth of residues with the highest reliability was predicted with over

90% accuracy. Rost and Sander also tested PHD on 26 new proteins, none with significant sequence similarity to any protein in the training set, and found comparable results. PHD, however, suffers from some problems. Rost and Sander were concerned with overtraining and therefore terminated training once the accuracy was higher than 70% for all training samples.

3.2 Chou- Fasman Method

The Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure. In this method, a helix is predicted if, in a run of six residues, four are helix favouring and the average valued of the helix propensity is greater than100 and greater than the average strand propensity. Such a helix is extended along the sequence until a proline is encountered (helix breaker) or a run of 4 residues with helical propensity less than 100 is found. A strand is predicted if, in a run of 5 residues, three are strand favouring, and the average value of the strand propensity is greater than 1.04 and greater than the average helix propensity. Such a strand is extended along the sequence until a run of 4 residues with strand propensity less than 100 is found.

3.3 Data Mining Model used for implementation of the CHOU- FASMAN method

As a part of the larger process known as knowledge discovery, data mining is the process of extracting information from large volumes of data. This is achieved through the identification and analysis of relationships and trends within commercial databases. Data mining is used in areas as diverse as space exploration and medical research. Here we gather data considering the known protein structures. The predictions are concerned with the data we already have. We compare the particular values of the protein we want to predict the structure and thus do the prediction on its structure

4. CHOU-FASMAN METHOD FOR PROTEIN STRUCTURE PREDICTION

The Chou-Fasman algorithm for the prediction of protein secondary structure is one of the most widely used predictive schemes. The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to the conformational parameters and positional frequencies. The Chou-Fasman algorithm is simple in principle.

The conformational parameters for each amino acid were calculated by considering the relative frequency of a given amino acid within a protein, its occurrence in a given type of secondary structure, and the fraction of residues occurring in that type of structure. These parameters are measures of a given amino acid's propensity (preference to be found in helix, sheet or coil). Using these conformational parameters, one finds nucleation sites within the sequence and extends them until a stretch of amino acids is encountered that is not disposed to occur in that type of structure or until a stretch is encountered that has a greater disposition for another type of structure. At that point, the structure is terminated. This process is repeated throughout the sequence until the entire sequence is predicted.

4.1 Propensity value

To predict secondary structure of a protein using Chou-Fasman method from a primary sequence require the knowledge of propensity value. Simply propensity value is the tendency of an amino acid to be present in α-helix or β –sheet. Suppose an amino acid which is having a higher propensity value for α (P(α)).That means that amino acid is alerted to be present in α-helix more than it is to be present in β-sheets. Similarly in the case of β-sheets also. Propensity value is depicted as P. So the propensity value for β will be Pβlikewise.

4.2 Calculation of the propensity value

4.3.1 α-Helix/ β -Sheets Nucleation

It is regarding the tendency to make helixes/sheets in our amino acids.it depends on how many α-helix / β-sheet makers and α-helix/ β-sheet breakers in the amino acid sequence we study. Normally an amino acid becomes a breaker or a maker because of the R part it got. According to the R part in the Amino acid it is
determined whether the particular amino acid is mostly in α-helix or β -sheet. So we use this concept to determine the secondary structure in Chou-Fasman Method.

The concept is for α-helix is:

If 6 contiguous residues have
More than 1/3 (>1/3 in this case 2) of α-helix breakers it should not form an α-helix.

Less than ½ (<1 3="" an="" case="" div="" form="" helix="" in="" it="" makers="" not="" of="" should="" this="">

The concept is for β -sheet is:

If 5 contiguous residues have

More than 1/3 (>1/3) of β -sheet breakers it should not form a β -sheet

Less than ½ (<1 -sheet="" a="" br="" form="" it="" makers="" not="" of="" should="">

4.3.2 α-Helix/ β -Sheets Termination

This is about the calculating the ending point of a α--helix or a β -sheet in the sequence here also we consider randomly a contiguous 6 residues and we apply the above rule and then consider this for the both

directions from our selected residues set adding one amino acid for a once and when we get 4 times residues having their propensity value less than 100 we can say that our structure end its α-helix from here or end its β - sheet from here and we check for a new sequence.

4.3.3 α-Helix/ β -Sheets overlapping comparison

When the particular sequence having both α-helix and β -sheets we need to determined which will it get

the most this is decided using the Pa and P β values if Pa is high we say its alpha if P β is high it in β .

So the conditions we check in this method are

H α<>B α
P α<1 .03="" br="" or="" p="">P α<>P β

and if we found four consecutives with Pα or P β less than 100 we say this segment is end from here and it gets this structure(may be α or β or may be not both???)

4.4The Algorithm

The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to those numbers.

The algorithm contains the following steps:
(a)   Assign    parameter   values   to   all residues of   the Peptide.

(b) Scan the peptide and identify regions where 4 out of 6 contiguous residues have P(α)>100.Theseregions nucleate α- helices. Extend these in both directions until a set of four contiguous residues have an average P(α)<100 .this="" br="" ends="" helix.="" the="">
(c)   Scan the peptide and identify regions where 3 out of 5 contiguous residues have P(β)>100.These   residues nucleate β- strands. Extend these in both directions until a set of four contiguous residues have an average P(β)<100 .this="" br="" ends="" strand.="">
(d) Any region containing overlapping α and β assignments are taken to be helical or β depending on if the average P(α) and P(β) for that region is largest. If this residues an α or β- region so that it becomes less than 5 residues, the α or β assignment for that region is removed.

(e) To identify a β-turn at residue number i, the product p(t) = f(i)f(i+1)f(i+2)f(i+3) is calculated. To predict a β- turn, the following three conditions have to be simultaneously fulfilled:

p (t)>0.000075
p(t) = f(i)f(i+1)f(i+2)f(i+3) .

Where the f(i+1) value for the i+1 residue is used, the f(i+2) value for the i+2 residue is used and the f(i+3) value for the i+3 residue is used

•    The average value for P (turn)>100 for four amino acids.
•    The average P (turn) is larger than the average P (α) as well as P(β).

(f)   The   remaining part of the sequence without Assignment = are considered as coils.

TABLE I CONFORMATIONAL PARAMETERS AND POSITIONAL FREQUENCIESOR

Α-HELIX, ß-SHEET AND TURN RESIDUES.

Name	P(a)	P(b)	P(turn)	f(i)	f(i+1)	f(i+2)	f(i+3)
A-Alanine	142	83	66	0.060	0.076	0.035	0.058
R-Arginine	98	93	95	0.070	0.106	0.099	0.085
N-Asparticacid	101	54	146	0.147	0.110	0.179	0.081
D-Asparagine	67	89	156	0.161	0.083	0.191	0.091
C-Cysteine	70	119	119	0.149	0.050	0.117	0.128
E-Glumaticacid	151	37	74	0.056	0.060	0.077	0.064
Q-Glutamine	111	110	98	0.074	0.098	0.037	0.098
G-Glycine	57	75	156	0.102	0.085	0.190	0.152
H-Histidine	100	87	95	0.140	0.047	0.093	0.054
I-Isoleucine	108	160	47	0.043	0.034	0.013	0.056
L-Leucine	121	130	59	0.061	0.025	0.036	0.070
K-Lysine	114	74	101	0.055	0.115	0.072	0.095
M-Methionine	145	105	60	0.068	0.082	0.014	0.055
F-Phenylalanin e	113	138	60	0.059	0.041	0.065	0.065
P-Proline	57	55	152	0.102	0.301	0.034	0.068
S-Serine	77	75	143	0.120	0.139	0.125	0.106
T-Threonine	83	119	96	0.086	0.108	0.065	0.079
W-Tryptophan	108	137	96	0.077	0.013	0.064	0.167
Y-Tyrosine	69	147	114	0.082	0.065	0.114	0.125
V-Valine	106	170	50	0.062	0.048	0.028	0.053

4.4 Choice of sequence format

There are various formats of Amino acid sequences, and each has its own set of characters and utility. To get a deeper understanding and better results it is essential to choose a valid input format. The various formats are:

•    Plain text format
•    FASTA format
•    Genetic Computer Group Format (GCG)
•     NEXUS
•    NBRF &PIR

Ex:-Plain text format:

Plain Text format looks like the following:

MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQEEEE MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTH TSTMDAQEVETIWTILPAIILILIALPSL RILYMMDEINNPSTVKTMGHQWY WSYEYTDYEDLSF DSYMIPTSELKPGELRLLEVDNRVVLPMEAAQQE.

5 RESULTS AND DISCUSSION

For a given sequence of amino acids, this technique first clusters the amino acids and then these amino acid clusters are analyzed to predict the structure of protein. The user inputs the primary structure of the protein i.e. The amino acid sequence. The clusters of amino acids are extended till a alpha-helix, beta helix or a turn are predicted using the conformational parameters and positional frequencies for α- helix, ß-sheet and turn residues.

The whole detailed method is explained below: Example:

INPQAIFDIQIKRLHEYKRQHHDKQVHMANLCVVGGFA VNGVAALHSDLVVKDLFPEYHQLWPNKFHNVTNGITP RRWIKQCNPALAALLDKSLQKEWANDLDQLINLVKLA DDAKFRQLYRVIKQANKVRLAEFVKVRTIDLNLLHILA LYKERIRENP

The above sequence is divided into clusters and from the table the conformational parameter and positional frequencies for α-helix, ß-sheet and turn residues are established.

Ex:-for first 6, the pα>100 so it may not be in α- helix and the P ß >100 so it may not be either in ß-sheet, and then we consider it for turns and we can consider the structure to be in turns. And to determine the end points we choose random from the sequence and do the algorithm for both sides until we meet the breaking criteria.so following is the Output after considering structure

Hence the final secondary structure of the given sequence is:

TTTBBBBBBBBBBBBBTTTTAAAAAAABBBBBBBTTTTTTTTTTTTTTTTTTTTTBBBBBTTAA AAAAAAAAAAAAAAATTBBBBBBTTTTTTTTTTTTBBBBBBTTTTAAAAAAAATTTTTTTTB BBBTTTT

6 RESULTS AND DISCUSSION

It attempts to classify amino acid in protein sequence according to their predicted local structure, which can be subdivided into three states: α-helix, β-sheet or turn.

•     Protein fold can be predicted with better accuracy with this technique.
•     Various other data mining techniques can be used to determine an optimum result.
•     Choice of various formats of amino acid sequences can be utilized.
•    Protein structure and protein function prediction can be done based on improved Chou-Fasman method which includes 4 amino acids enabling a reverse β- turn.

7 FUTURE SCOPE OF WORK

There are lot of researches going on still regarding the improvements of the Chou-Fasman algorithm and there are several modified algorithms can be found when we search.

Following improvements regarding the developed model of bioinformatics can be made:

•     The system can be extended to predict the tertiary structure of the protein.
•     Various different mining techniques can be utilized to determine the optimum result.
•    Different formats of amino acids can be utilized.
•     Protein fold can be predicted with better accuracy with using this technique.
•     This technique can be further extended for multiple sequence alignment.

8 REFERENCES

[1]   András Fiser, Andrej Sali (2000) “Comparative protein structure modeling” Pels Family Center for Biochemistry and Structural Biology,The Rockefeller University, pp 82-88.
[2]    Andreas Rechtsteiner, Jeremy Luinstra, Luis M Rocha, Charlie E M Strauss (2006) “Use of Text Mining for Protein Structure Prediction and Functional    Annotation    in Lack of Sequence Homology”   Center of Genomics and Bioinformatics,   Indiana University, Bloomington,   IN 47401, pp 1-4.
[3]    Ben Blum, Michael I. Jordan (2007) “Feature Selection Methods for Improving Protein Structure Prediction with Rosetta” Department of Electrical Engineering and Computer Science University of California at Berkeley, CA 94305, pp1-7.
[4] Chen Yonghui, Reilly Kevin D., Sprague Alan P., Guan Zhijie, “SEQOPTICS: a protein sequence clustering system” Symposium of Computations in Bioinformatics and Bioscience (SCBB06) in conjunction with the International Multi-Symposiums   on Computer and Computational Sciences 2006 (IMSCCS|06) Hangzhou, China. June 20–24, 2006, pp 1-5.
[5]    Eisen Michael B., Spellman Paul T., Brown Patrick O., Botstein David (1998)   “Cluster   analysis and    display    of    genome-wide   expression patterns” Proc. Natl. Acad. Sci. USA.Vol. 95, pp.14863–14868.
[6]   Fraley Chris, Raftery Adrian E. (1998) “How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis” The computer journal, Vol. 41, No. 8, 1998 pp 578-587.
[7]   DSVGK Kaladhar (2012) “protein secondary structure prediction:an application of chou-fasman algorithmin a hypothetical protein of sars virus” Int. J. LifeSc. Bt & Pharm. Res.Vol.1, Issue. 1, January 2012pp 1-3.
[8] Fraley Chris, Raftery Adrian E. (2000) “Model based clustering, Discriminant Analysis, and density estimation”   Working Paper no II, Center for statics and social science, University of Washington, USA, pp1-28.
[9]   George   Tzanis,   Christos   Berberidis,   and   Ioannis   Vlahavas   (2002) “Biological Data Mining” Department of Informatics, Aristotle University of Thessaloniki, Greece, pp 1-8.

| Welcome to my blog!

Meet the Author

Subscribe to this blog!

Sunday, December 14, 2014