Despite the fact that for the majority of proteins the complete sequence is already known, their molecular function is not yet fully determined. Predicting protein function is still a bottleneck in computational biology research and many experimental and computational techniques have been developed in order to infer protein function from interactions with other biomolecules. Large-scale and high-throughput techniques can detect proteins that interact within an organism. Among them, the most well-known are the pull down assays [ 2 ], tandem affinity purification TAP [ 3 ], yeast two-hybrid Y2H [ 4 ], mass spectrometry [ 5 ], microarrays [ 6 ] and phage display [ 7 ].
Some very well-known datasets that have been recently produced by employing the aforementioned techniques and that are widely used are the Tong [ 8 ], Krogan [ 9 ], DIP [ 10 ], MIPS [ 11 ], Gavin [ 5 ] and Gavin [ 12 ] datasets. Besides the various experimental methods, a variety of large biological databases that contain information concerning PPI data is already available and most of them are organism specific. Two additional well-documented services based on text mining analysis are the Stitch [ 22 ] and String [ 23 ] databases.
Regulatory networks GRNs contain information concerning the control of gene expression in cells. This process is modulated by many variables, such as transcription factors [ 24 ], their post-translational modifications or association with other biomolecules [ 25 ]. Usually, these networks use a directed graph representation in an effort to model the way that proteins and other biological molecules are involved in gene expression and try to imitate the series of events that take place in different stages of the process.
They often exhibit specific motifs and patterns concerning their topology. Data collection, data integration and analysis techniques give now the possibility to study gene regulatory networks in a larger scale [ 26 ]. Signal transduction networks often use multi-edged directed graphs to represent a series of interactions between different bioentities such as proteins, chemicals or macromolecules and to investigate how signal transmission is performed either from the outside to the inside of the cell, or within the cell.
Environmental parameters change the homeostasis of the cell and, depending on the circumstances, different responses can be triggered. Similarly to GRNs, these networks also exhibit common patterns and motifs concerning their topology [ 34 ]. Metabolic and biochemical networks [ 37 ] are powerful tools for studying and modelling metabolism in various organisms. As metabolic pathways, we consider a series of chemical reactions occurring within a cell at different time points.
The main role within a metabolic network is played by the enzymes, since they are the main determinants in catalyzing biochemical reactions.
Often, enzymes are dependent on other cofactors such as vitamins for proper functioning. The collection of pathways, holding information about a series of biochemical events and the way they are correlated, is called a metabolic network.
Modern sequencing techniques allow the reconstruction of the network of biochemical reactions in many organisms, from bacteria to human [ 38 , 39 ]. Several methods have also been discovered to analyze the pathway structure of metabolic networks [ 44 — 48 ]. Many computer readable formats are available to describe biological networks. SBML can represent metabolic networks, cell signaling pathways, regulatory networks, and many other kinds of systems [ 50 ].
Secondary formats that can also be used in similar ways are the Cell Markup Language [ 55 ] which is an XML-like machine-readable language mainly developed for the exchange of computer-based mathematical models or the Resource Description Framework, RDF which is a language for the representation of information about resources on the World Wide Web [ 56 , 57 ].
After having given a short overview of how data can be produced either experimentally or retrieved from various databases and which formats are available for each type of network, we further emphasize on the computational analysis as defined in graph theory.
We finally conclude by describing which properties of the ones discussed below characterize the various networks. To introduce the basic concepts of graph theory, we give both the empirical and the mathematical description of graphs that represent networks as they are originally defined in the literature [ 58 , 59 ]. A graph G can be defined as a pair V, E where V is a set of vertices representing the nodes and E is a set of edges representing the connections between the nodes.
In this case, we say that i and j are neighbors. A multi-edge connection consists of two or more edges that have the same endpoints.
Such multi-edges are especially important for networks in which two elements can be linked by more than one connection. In such cases, each connection indicates a different type of information.
This is an important feature since there are networks such as protein-protein interaction networks in which two proteins might be evolutionary related, co-occur in the literature or co-express in some experiments, resulting by this way in three different connections, each one with a different meaning. An example of PPI database that takes into account the different types of interactions between proteins is String [ 23 ]. The ordered pairs of vertices are called directed edges, arcs or arrows.
Directed graphs are mostly suitable for the representation of schemas describing biological pathways or procedures which show the sequential interaction of elements at one or multiple time points and the flow of information throughout the network. These are mainly metabolic, signal transduction or regulatory networks [ 34 ]. Most of the times, the weight w ij of the edge between nodes i and j represents the relevance of the connection.
In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context. So many things in the world would have never come into existence if there hadn't been a problem that needed solving. This truth applies to.
Usually, a larger weight corresponds to higher reliability of a connection. Weighted graphs are currently the most widely used networks throughout the field of bioinformatics. As an example, relations whose importance varies are frequently assigned to biological data to capture the relevance of co-occurrences identified by text mining, sequence or structural similarities between proteins or co-expression of genes [ 23 , 60 ]. Applications of this type of graph to visualization or modeling of biological networks range from representation of enzyme-reaction links in metabolic pathways to ontologies or ecological connections, as discussed in [ 61 ] or [ 62 ].
Examples and shapes describing the aforementioned graph types can be found in Figure 1. The most common data structures that are used to make these networks computer readable are adjacency matrices or adjacency lists. The following section provides a short mathematical description of these data structures.
Undirected, Directed, Weighted, Bipartite graphs. If a network is directed, then each node has two different degrees, the in-degree deg in i which is the number of incoming edges to node i , and the out-degree deg out i which is the number of outgoing edges from node i. The total connectivity of a network is defined as where E is the number of edges and N the total number of nodes. The connectivity structure of biological networks is often informative with respect to reaction interplay and reversibility, compounds that structure the network, like in metabolism, or trophic relationships, like in food-web networks.
Such connectivity profiles can be detected based on mixture models using software like MixNet [ 63 ]. The aforementioned rule does not apply to directed graphs, because in that case the upper and the lower triangle parts of the matrix reveal the direction of the edges. This data structure is more efficient for cluttered networks, where the density of the connections between elements is relatively high. In the case of a fully connected graph where all nodes are connected with each other, adjacency matrices are highly suggested. Matrix B currently hosts the lower part of matrix A.
The 1D array will be of size including the diagonal. An example of how these data structures represent a graph is given in Figure 2. Data structures.
A Directed Graph: A random graph consisting of five nodes and six directed edges. Adjacency List: The data structure which represents the directed graph using lists. Adjacency Matrix: The data structure which represents the directed graph using a 2D matrix. The zeros represent the absence of the connection whereas the ones represent the existence of the connection between two nodes.
The matrix is not symmetric since the graph is directed. Looking at different network properties can provide valuable insight into the internal organization of a biological network, the repartition of molecules among cellular processes, as well as the evolutionary constraints that have shaped an organism's protein, metabolic or regulatory network into a functional, feasible structure. In the following, we give a short description of the main properties that are commonly analyzed in networks.
The graph density shows how sparse or dense a graph is according to the number of connections per node set and is defined as. Dense is a graph where E " V 2. It has been argued that biological networks are generally sparsely connected, as this confers an evolutionary advantage for preserving robustness.
Berg J, Lassig M, Wagner A: Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. Here are the steps to follow: 1. These details are left out of the definition of a graph for an important reason; they are not a necessary part of the graph abstraction. Try to spot it. The bridges were very beautiful, and on their days off, townspeople would spend time walking over the bridges.
This has been observed for a series of organisms: the transcriptional regulatory networks of S. In the mathematical field of graph theory, a complete graph is a simple graph in which every pair of distinct vertices is connected by a unique edge.