Characterization of a hypothetical protein of Homo sapiens GI: 2135416 – An insilico approach

· Articles

Baphilinia Jones Mylliemngap1 and Atanu Bhattacharjee1*

1Department of Biotechnology and Bioinformatics, North Eastern Hill University, Shillong-793022, India

*email:, Phone- +91-9336703339.


The complete human genome sequences provide a way to understand the blue print of life.  With its completion, large-scale information has been generated both in terms of genes and proteins. Characterization of both genes and proteins is important for determining the regulatory mechanisms and functions. Thus, characterization of a hypothetical protein is performed to aid in the determination of the protein function. The hypothetical protein characterization showed a 40% homology to another human hypothetical protein whose function is yet to be determined but the sequence indicate a high level of conservation and stability of the protein structure. Further research involving development of appropriate strategies will provide new avenues in the field of medicine and research.

Keywords Homo sapiens, hypothetical protein


Humans, known taxonomically as Homo sapiens, are the only living species in the Homo genus of bipedal primates in Hominidae having a genome of over 3 billion DNA base pairs organized into two sets of 23 chromosomes with 22 pairs of autosomes and one pair of sex chromosomes containing an estimate of about 20,000–25,000 genes approximately [1]. The Human Genome Project initially headed by Ari Patrinos began in 1990 and a working draft of the human genome was released in 2000 and a complete one in 2003 (Goodman et al., 1990).

The large scale genome sequencing project has generate an excess of information both in terms of genes and proteins with a vast amount of proteins whose function and structure has not been determined [3]. A protein whose existence has been predicted but there is no experimental evidence that it is expressed in vivo or it is unclear that they encode for a particular function is termed as hypothetical or non-characterized. But these proteins can be important and thus arises the need to characterize the hypothetical proteins whose only primary information of sequence is available. Determining the protein function can be a challenging problem [4] but the results generated will help in providing an insight into the various gene regulatory mechanisms, metabolic and the functional aspect of organism and if it is related to any disease condition. Therefore, the present work involves the broad use of tools and graphical software for a complete annotation of the hypothetical protein (Accession no. GI: 2135416).

The need to further analyze the genes and proteins of the human genome will provide new avenues in the field of medicine and research and also lead in the long-term significant advancement in their management. Thus with more information, studying a disease condition and understanding the genomic characterization would help in developing new drugs and antibiotics and treatment of the disease.


To analyze the hypothetical protein and assigning its functions and structural roles, various tools and software were used. The primary sequence of the hypothetical protein (Accession no. GI: 2135416) was obtained from GenBank at the National Centre for Biotechnology Information. The sequence was blast using Basic Local Alignment Search Tool (BLAST) for comparing and detection of the homologous sequences finding regions of local similarity between sequences. The program compares the protein sequences to the database and calculates the statistical significance of matches and used to infer functional and evolutionary relationships between sequences as well as helps identify members of gene families [5]. The conserved domains of the protein were also detected from the BLAST analysis predicting the independently stable functional units of the protein.

The physico-chemical properties of the protein were calculated using ProtParam which allows the computation of various physical and chemical parameters for a given protein including the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) [6]. The secondary structure of the protein was analyzed using tools such as SOPMA [7] predicting the different structures of the proteins at different regions.

Transmembrane regions of the protein were analyzed using TMpred [8]. The TMpred program makes a prediction of membrane-spanning regions and their orientation using a combination of several weight-matrices for scoring. The algorithm is based on the statistical analysis of TMbase, a database of naturally occuring transmembrane proteins.

The hypothetical protein was submitted to DisEMBL [9] that predicts the disordered intrinsic protein. DisEMBL is a computational tool for prediction of disordered/unstructured regions within a protein sequence containing short linear peptide motifs that are important for protein functions. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects

The primary structure of the protein was analyzed where detection and alignment of the repeats in the protein sequence was predicted by RADAR which automatically detects and aligns the sequences where it identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence.

The mitochondrial targeting sequence and the cleavage site of the protein were calculated using MitoProt [10] that calculates the N-terminal protein region. The MitoProt also predicts the probability of the protein that can be imported to the mitochondria.


The similarity search for the sequence was carried out using BLAST tool indicating the similarity of 40% with a score of 112 to a human protein PRO0657 which is also another hypothetical protein and no domain was indicated to be similar to the protein (Figure 1 and Table 1).

The physico-chemical properties of the protein reveal that the protein has 196 amino acids with molecular weight of 22299.1 and a theoretical pI of 8.69. The maximum number of amino acid present in the sequence is leucine with 9.7% and the least number of amino acid present is that of phenylalanine with 0.5% (Table 2).

The total number of negatively charged residues (Asp + Glu) is 25 and the total number of positively charged residues (Arg + Lys) is 29. The index stability of the protein was computed to be 53.06, which classify the protein to be unstable. The grand average of hydropathicity was calculated to be -0.931 with the aliphatic index at 73.11

The secondary structure analysis of the protein indicates the random coils to be most frequently found (42.35%) followed by the alpha helix (30.61%) and extended strand (16.33%) with beta turns to be the least frequent (10.71%) (Figure 2)

The protein was not found to be a transmembrane as there were no transmembrane helices seen outside to inside or inside to outside with a significant score above 500 (Figure 3).

The disordered region of the protein was analyzed and found that the sequences are mostly disordered by loops/coils having a maximum probability of 0.8 (Figure 4). Figure 5 shows the sequences disordered by loops and coils.


Figure1: BLAST result for alignment score


The repeat sequence of the protein on analysis was predicted with two numbers of repeats and a calculated score of 136.27 (Figure 6)

The mitochondrial targeting sequence of the protein was calculated with a probability of 0.4267 to be exported to the mitochondria and the number of basic and acidic residues in the targeting sequence is 7 and 2 respectively. The cleavage site of the protein was not predicted. By means of ECS, the mean of Hd and Hmax value for the protein conform with the reported mitochondrial proteins for the targeting sequence.

Table 1: BLAST result of the hypothetical protein producing the significant alignment

Accession number Similar Hits Score E-value
AAF24054.1 PRO0657 [Homo sapiens] 112 3e-23
EAW63194.1 hCG2040615 [Homo sapiens] 90.5 1e-16
BAC86569.1 unnamed protein product [Homo sapiens] 89.0 3e-16
BAC85305.1 unnamed protein product [Homo sapiens] 88.6 3e-16
EAW97490.1 hCG2038211 [Homo sapiens] 84.0 9e-15
BAH12393.1 unnamed protein product [Homo sapiens] 82.8 2e-14
BAC05317.1 unnamed protein product [Homo sapiens] 75.5 3e-12
EAW78405.1 hCG2021310 [Homo sapiens] 70.9 9e-11


Table 2: Physico-chemical properties of the protein

Amino acids
Number of residues
Percentage of residues
Ala (A)
Arg (R)
Asn (N)
Asp (D)
Cys (C)
Gln (Q)
Glu (E)
Gly (G)
His (H)
Ile (I)
Leu (L)
Lys (K)
Met (M)
Phe (F)
Pro (P)
Ser (S)
Thr (T)
Trp (W)
Tyr (Y)
Val (V)
Pyl (O)
Sec (U)
Asx (B)
Glx (Z)
Xaa (X)


Figure 2: Graphical representation of secondary elements in the protein


Figure 3: Prediction plot for the topology assignment of unknown proteins


Figure 4: Disordered region of the protein


Figure 5: Disordered sequences of the protein


Figure 6: Repeat sequences of the protein


Table 3: Hydrophobic scale

H17 -0.200 0.535 -0.290 0.297
MesoH -1.730 -0.412 -0.682 -0.030
MuHd_075 27.707 21.130 8.072 6.969
MuHd_095 43.207 28.666 11.310 10.568
MuHd_100 35.264 23.299 10.444 8.598
MuHd_105 40.390 27.975 12.789 9.139
Hmax_075 5.100 7.700 -0.132 3.180
Hmax_095 15.100 16.900 3.427 6.280
Hmax_100 14.700 16.500 4.331 4.030
Hmax_105 16.100 20.500 5.192 6.390


Where GES-Goldman, Engelman and Steitz scale; KD-Kyte and Doolittle scale; GvHI-Gunnar von Heijne scale; ECS-Eisenberg’s consensus scale; MesoH-average of the maximal hydrophobicity; Hd-Hydrophobic moments; Hmax-Hydrophobic faces.


The analysis of the hypothetical protein showed sequence similarity to another human hypothetical protein with 40% similarity though its function is yet to be determined. The secondary structure of the protein on analysis shows random coils to be more frequent than any other secondary structure elements indicating the high level of conservation and stability of the protein structure. Though the protein is disordered at some regions, the disorder differs by different definition. The repeats identified in the protein are tandemly repeated modules of 29 amino acids and the protein has a probability of 0.4267 indicating a non-mitochondrial targeting sequence or a non-mitochondrial localization but with the hydrophobic scale showing a lesser value for H17 and MesoH and a high value for the mean Hd and Hmax indicates that the protein may also be a mitochondrial protein. Further research involving development of appropriate strategies for studying these repeats and the protein as a whole could be of significance in relation to the genes encoding the domains and its transfer.



  1.  Goodman M, Tagle D, Fitch D, Bailey W, Czelusniak J, Koop B, Benson P, Slightom J. Primate evolution at the DNA level and a classification of hominoids. J Mol Evol 30 (3): 260–266 (1990).
  2.  About the Human Genome Project: What is the Human Genome Project”. The Human Genome Management Information System (HGMIS). 2011-07-18. Retrieved 2011-09-02.
  3. Bhattacharjee A, Choudhury H, Maheswari U, Joshi S. In-silico prediction of structural and functional aspects of a hypothetical protein of Arabidopsis thaliana (L) Heynh. Advanced Biotech  pp 14-16. (2008).
  4. Sivashankari S and Shanmughavel P, Functional annotation of hypothetical proteins- A review. Bioinformation 1(8): 335-338 (2006).
  5. Altschul, S; Gish, W; Miller, W; Myers, E; Lipman, D. Basic local alignment search tool. Journal of Molecular Biology 215 (3): 403–410. (1990)
  6.  Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. Protein Identification and Analysis Tools on the ExPASy Server; (In) John M. Walker (ed): The Proteomics Protocols Handbook, Humana Press pp 571-607. (2005)
  7. Geourjon C. and Deleage G.  SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Comput Appl Biosci . 11(6):681-684 (1995).
  8.  Hofmann K. and W. Stoffel. TMbase – A database of membrane spanning proteins segments. Biol. Chem. Hoppe-Seyler 374,166 (1993).
  9.  Linding R., L.J. Jensen, F. Diella, P. Bork, T.J. Gibson and R.B. Russell. Protein disorder prediction: implications for structural proteomics. Structure 11 (11), (2003).
  10.  Claros M.G., P. Vincens. Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur. J. Biochem. 241, 779-786 (1996).
VN:F [1.9.21_1169]
Rating: 0.0/10 (0 votes cast)