Ajit Kumar Roy
Dept. of Economics and Statistics,
College of Fisheries, CAU, Lembucherra, Tripura – 792210
Biotechnology is progressing at a fast pace with a huge amount of data being continuously generated worldwide. This article discusses the sources, tools and uses of biotechnological data.
An exceptional wealth of biological data has been generated by the human genome project and sequencing projects in many other organisms. The huge demand for analysis and interpretation of these data is being managed by bioinformatics. Bioinformatics is defined as the application of tools of computation and analysis to the capture and interpretation of biological data. It is an interdisciplinary field, which harnesses computer science, mathematics, physics, and biology
Bayat accurately and broadly defines the discipline as “the application of tools of computation and analysis to the capture and interpretation of biological data” and, operationally, that “The main tools of a bioinformatician are computer software programs and the internet. A fun-damental activity is sequence analysis of DNA and proteins using various programs and data-bases available on the world wide web”
NCBI also defines bioinformatics as a single broad discipline, but with three “important sub-disciplines”:
• “The development of new algorithms and statistics with which to assess relationships among members of large data sets;
• The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and
• The development and implementation of tools that enable efficient access and manage-ment of different types of information” (http://www.ncbi.nlm.nih.gov/Education/)
However, Ellis (2003a) notes 40 published operational definitions between 2000-2001 and another 37 (2003b) in 2003, suggesting the definitions vary by subdiscipline. It appears that several specialties were working out for themselves their roles in molecular biology and the legacy of computational biology’s influence on their work.
Features of Bioinformatics
The features of bioinformatics are of three types. First, at its simplest bioinformatics organizes data in a way that allows researchers to access existing information and to submit new entries as they are produced like 3D macromolecular structures. While data-curation is an essential task, the information stored in these databases is essentially useless until analyzed. Thus the purpose of bioinformatics extends much further. The second feature is it develops tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences. This needs more than just a simple text-based search and programs such as alignment tools and sequence homology search tool that must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology. The third feature is that these tools could be used to analyze the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared those with a few that are related. In bioinformatics, we can now conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features. However in recent years new directions of bioinformatics have emerged that has featured its future. The practice of studying genetic disorders is changing from investigation of single genes in isolation to discovering cellular networks of genes, understanding their complex interactions, and identifying their role in disease. 19 As a result of this, a whole new age of individually tailored medicine will emerge. Bioinformatics will guide and help molecular biologists and clinical researchers to capitalize on the advantages brought by computational biology. 20 The clinical research teams that will be most successful in the coming decades will be those that can switch effortlessly between the laboratory bench, clinical practice, and the use of these sophisticated computational tools. Artificial intelligence has been incorporated into machine learning and neural network formations for better understanding of a disease. This holds a prospect of designing new algorithms for better incorporation of artificial intelligence.
Table 1. Sources of data used in bioinformatics, the quantity of each type of data that was available, and bioinformatics subject areas that utilize this data.
|Data source||Bioinformatics topics|
|Raw DNA sequence||Separating coding and non-coding regionsIdentification of introns and exonsGene product predictionForensic analysis|
|Protein sequence||Sequence comparison algorithmsMultiple sequence alignments algorithmsIdentification of conserved sequence motifs|
|Macromolecularstructure||Secondary, tertiary structure prediction3D structural alignment algorithmsProtein geometry measurementsSurface and volume shape calculationsIntermolecular interactions|
|Genomes||Characterisation of repeatsStructural assignments to genesPhylogenetic analysisGenomic-scale censuses(characterisation of protein content, metabolic pathways)Linkage analysis relating specific genes to diseases|
|Gene expression||Correlating expression patternsMapping expression data to sequence, structural andbiochemical data|
|Literature||Digital libraries for automated bibliographical searchesKnowledge databases of data from literature|
|Metabolic pathways||Pathway simulations|
Application of Bioinformatics:
1. Sequence Analysis:
Sequence analysis has been an important technique in bioinformatics. Apart from the basic features that just represent the nucleotide or amino acid at each position in a sequence, many other features, such as higher order combinations of these building blocks can be derived, their number growing exponentially with the pattern length. The prediction of subsequences that code for proteins has been a focus of interest since the early days of bioinformatics (Saeys et al., 2007). Many features can be extracted from sequences and to deal with the high amount of possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the interpolated Markov model (IMM), which used interpolation between different orders of the Markov model to deal with small sample sizes, and a filter method to select only relevant features.
A second class of techniques focuses on the prediction of protein function from sequence. The early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired researchers to use FS techniques to focus on important subsets of amino acids that relate to the protein’s functional class (Al-Shahib et al., 2005). An interesting technique is described in Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a way to asses feature weights, and subsequently remove features with low weights.
Sequences are also involved in the recognition of conserved signals, representing mainly binding sites for various proteins or protein complexes. A common approach to find regulatory motifs, is to relate motifs to gene expression levels using a regression approach (Saeys et al., 2007).Feature selection can then be used to search for the motifs that maximize the fit to the regression model (Keles et al., 2002; Tadesse et al., 2004). In Sinha (2003), a classification approach is chosen to find discriminative motifs.
2. High through put genome analysis:
Advances in sequencing technology have led to a remarkable increase in the production of experimental data. Genomics studies now typically involve the analysis of dozens of sequencing datasets such as transcripts/genes, exons/introns, promoter sites, alignments, binding sites, repeat elements, microarray probes, sequencing data (RNA-seq, ChIP-seq, DNA-seq, etc.), or chromosomal conformations (3C-seq, 4C-seq, etc.) can be represented as genomic regions, i.e. ordered sets of genomic intervals, which in turn are defined as tuples: <chromosome, strand, start position, end position> and with the amount of data it requires an efficient management of computational resources such as time, memory and development time (Tsirigos et al., 2012).Gene finding where prediction of introns and exons in a segment of DNA sequence, sequence comparison, transcriptome analysis and many other genome analysis are used by bioinformatics where the each datasets can be well studied.
3. Microarray analysis:
The advent of microarray datasets motivated a new line of research in bioinformatics. Microarray data pose a great challenge for computational techniques, because of their large dimensionality and their small sample sizes (Somorjai et al., 2003). In order to deal with these particular characteristics of microarray data, the obvious need for dimension reduction techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon their application became a concerning fact in the field.
4. Mass spectra analysis:
Mass spectrometry technology (MS) is an emerging and attractive framework for disease diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum sample is characterized by mass/charge (m/ z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. For data mining and bioinformatics purposes, it can initially be assumed that each m/ z ratio represents a distinct variable whose value is the intensity. Somorjai et al. (2003) explains the data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness. Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples (Coombes et al., 2007), the following step is to extract the variables that will represent the initial pool of candidate discriminative features.
- Bayat, A. (2002, April). Science, medicine, and the future: Bioinformatics. British Medical Journal 324: 1018-1022. Retrieved June 24, 2003 from http://bmj.com/cgi/reprint/324/7344/1018.Ellis, L. (2003a) What is bioinformatics?
- 2000-2001. Retrieved June 24, 2003 from http://www.binf.umn.edu/whatsbinf2000.html
- Ellis, L. (2003b). What is bioinformatics? 2003. Retrieved June 24, 2003 from http://www.binf.umn.edu/whatsbinf.html
- Debouk C, Metcalf B. The impact of genomics on drug discovery. Annu Rev Pharmacol Toxicol 2000;40:193208.
- Butler D. Are you ready for the revolution? Nature 2001;409:75860
- Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007 Oct 1;23(19):2507-17. Epub 2007 Aug 24.
- Salzberg,S., et al. (1998) Microbial gene identification using interpolated markov models. Nucleic Acids Res., 26, 544–548.
- Chuzhanova,N., et al. (1998) Feature selection for genetic sequence classification. Bioinformatics, 14, 139–143.
- Al-Shahib,A., et al. (2005) Feature selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics, 4, 195–203.
- Zavaljevsky,N., et al. (2002) Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18, 689–696.
- Keles,S., et al. (2002) Identification of regulatory elements using a feature selection method. Bioinformatics, 18, 1167–1175.
- Tadesse,M., et al. (2004) Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics, 20, 2553–2561.
- Sinha,S. (2003) Discriminative motifs. J. Comput. Biol., 10, 599–615.
- Tsirigos A, Haiminen N, Bilal E, Utro F. GenomicTools: a computational platform for developing high-throughput analytics in genomics. Bioinformatics. 2012 Jan 15;28(2):282-3. Epub 2011 Nov 22.
- Somorjai,R., et al. (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics, 19, 1484–1491.
- Alon,U. et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. USA, 96, 6745–6750.
- Ben-Dor,A., et al. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 559–584.
- Golub,T., et al. (1999)Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
- Ross,D., et al. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet., 24, 227–234.
- Petricoin,E. and Liotta,L. (2003) Mass spectometry-based diagnostic: the upcoming revolution in disease detection. Clin. Chem., 49, 533–534.
- Coombes,K., et al. (2007) Pre-processing mass spectrometry data. In Dubitzky,M., et al. (eds.), Fundamentals of Data Mining in Genomics and Proteomics. Kluwer, Boston, pp. 79–99.