Skip to content

Chapter 51. Network Reliability

Network reliability studies whether a network continues to function when some of its parts fail.

In graph theory, a network is modeled as a graph. Vertices may represent routers, stations, cities, servers, processors, or people. Edges may represent links, roads, cables, communication channels, dependencies, or relationships. Reliability asks whether the remaining graph still has the required connectivity after some vertices or edges are removed.

This chapter connects the deterministic language of connectivity with probabilistic models of failure.

51.1 Deterministic and Probabilistic Reliability

The previous chapters treated connectivity deterministically.

For example, if

λ(G)=3, \lambda(G)=3,

then at least three edges must be removed before the graph can become disconnected.

This gives a worst-case guarantee. It assumes that failures may occur in the most damaging possible locations.

Network reliability asks a probabilistic question. Instead of assuming a chosen set of failures, we assign probabilities to failures and ask how likely the network is to remain operational. In standard graph reliability models, vertices or edges have probabilities of being operational, and the graph is considered functional if a specified connectivity condition remains true.

Thus there are two complementary viewpoints:

ViewpointQuestion
ConnectivityHow many failures can the graph always tolerate?
ReliabilityWhat is the probability that the graph survives random failures?

Both are useful. Connectivity gives structural guarantees. Reliability gives quantitative risk.

51.2 Edge Failure Model

The simplest reliability model assumes that vertices never fail and edges fail independently.

Let

G=(V,E) G=(V,E)

be a connected graph. Each edge is operational with probability pp and failed with probability

1p. 1-p.

The operational subgraph is obtained by keeping each edge independently with probability pp.

The all-terminal reliability of GG is the probability that the operational subgraph remains connected.

It is often written as

RG(p). R_G(p).

Thus

RG(p)=Pr(Gp is connected), R_G(p)=\Pr(G_p \text{ is connected}),

where GpG_p denotes the random subgraph produced by retaining each edge with probability pp.

This definition turns connectivity into a probability.

51.3 State Enumeration

A state of the network is a choice of which edges are operational.

If GG has mm edges, then there are

2m 2^m

possible edge states.

For a state AEA\subseteq E, the edges in AA are operational, and the edges in EAE\setminus A have failed.

If all edges work independently with the same probability pp, then the probability of state AA is

pA(1p)mA. p^{|A|}(1-p)^{m-|A|}.

The network is connected in state AA precisely when the spanning subgraph

(V,A) (V,A)

is connected.

Therefore

RG(p)=AE(V,A) connectedpA(1p)mA. R_G(p)= \sum_{\substack{A\subseteq E\\(V,A)\text{ connected}}} p^{|A|}(1-p)^{m-|A|}.

This formula is exact, but it is usually not computationally efficient. The number of states doubles with every additional edge.

51.4 Reliability Polynomial

The expression RG(p)R_G(p) is called the reliability polynomial of the graph.

It records the probability that the graph remains operational when edges fail independently with a fixed probability. The reliability polynomial is widely used as a graph invariant in network reliability.

For a graph with nn vertices and mm edges, a connected operational subgraph must contain at least

n1 n-1

edges. Hence the reliability polynomial has the form

RG(p)=i=n1mcipi(1p)mi, R_G(p)= \sum_{i=n-1}^{m} c_i p^i(1-p)^{m-i},

where cic_i is the number of connected spanning subgraphs with exactly ii edges.

The coefficients cic_i have a direct meaning:

CoefficientMeaning
cn1c_{n-1}Number of spanning trees
cic_iNumber of connected spanning subgraphs with ii edges
cmc_mEqual to 11, the full graph

This gives a bridge between reliability, spanning trees, and graph enumeration.

51.5 Example: A Path

Let PnP_n be a path with nn vertices.

It has

n1 n-1

edges. If any one edge fails, the path becomes disconnected. Therefore all edges must be operational.

Thus

RPn(p)=pn1. R_{P_n}(p)=p^{n-1}.

A path is minimally connected. It has no redundant edges. Its reliability decreases quickly as the number of edges grows.

For example, if n=5n=5, then

RP5(p)=p4. R_{P_5}(p)=p^4.

At p=0.9p=0.9, this gives

0.94=0.6561. 0.9^4=0.6561.

Even though each edge is reliable individually, the whole path is less reliable because every edge is essential.

51.6 Example: A Cycle

Let CnC_n be a cycle with nn vertices.

The cycle remains connected if at least

n1 n-1

of its nn edges are operational. It becomes disconnected only if at least two edges fail.

Therefore

RCn(p)=pn+npn1(1p). R_{C_n}(p)=p^n+n p^{n-1}(1-p).

The first term corresponds to the state in which all edges work. The second term corresponds to the nn states in which exactly one edge fails.

This may be simplified:

RCn(p)=pn1(n(n1)p). R_{C_n}(p)=p^{n-1}\bigl(n-(n-1)p\bigr).

Cycles are more reliable than paths with the same number of vertices because they contain one redundant route.

51.7 Two-Terminal Reliability

All-terminal reliability requires the whole graph to remain connected.

Sometimes only two specified vertices matter.

Let s,tVs,t\in V. The two-terminal reliability between ss and tt is the probability that ss and tt remain connected in the operational subgraph.

It is written as

Rs,t(p). R_{s,t}(p).

Thus

Rs,t(p)=Pr(s is connected to t in Gp). R_{s,t}(p)=\Pr(s \text{ is connected to } t \text{ in } G_p).

This model is useful when the network must preserve communication between a source and a destination, but other vertices may be disconnected without causing failure.

There are also intermediate models, such as KK-terminal reliability, where a specified subset KVK\subseteq V must remain mutually connected.

51.8 Vertex Reliability

In the vertex failure model, edges remain available, but vertices fail independently.

Each vertex is operational with probability pp. The operational subgraph is induced by the operational vertices.

One may then ask whether the remaining graph is connected, or whether specified terminals remain connected.

Vertex reliability is useful when vertices represent machines, routers, substations, people, processors, or facilities. Edge reliability is useful when edges represent physical or logical links.

Many real systems require both models. A communication network may fail because routers fail, links fail, or both.

51.9 Mixed Reliability

In mixed reliability, both vertices and edges may fail.

Each vertex vv has an operational probability pvp_v, and each edge ee has an operational probability pep_e. The operational network contains only working vertices and working edges whose endpoints are also working.

The reliability question then depends on the chosen success condition.

Typical success conditions include:

Reliability typeSuccess condition
All-terminalAll operational or required vertices remain connected
Two-terminalA specified pair s,ts,t remains connected
KK-terminalAll vertices in a terminal set KK remain connected
kk-connected reliabilityThe operational graph remains kk-connected
Flow reliabilityEnough capacity remains between terminals

Mixed models are closer to engineering systems but harder to compute.

51.10 Connectivity as a Lower Bound

Connectivity gives a deterministic lower bound on failure tolerance.

If a graph is kk-edge-connected, then deleting fewer than kk edges cannot disconnect it.

Therefore, in an independent edge failure model, the graph can fail only if at least kk edges fail.

This does not determine the exact reliability, because not every set of kk failed edges disconnects the graph. But it gives a useful first estimate.

Similarly, if a graph is kk-vertex-connected, then at least kk vertices must fail before disconnection is possible.

Thus higher connectivity generally improves reliability, but reliability also depends on the number and arrangement of minimum cuts.

51.11 Minimum Cuts and Failure Probability

Suppose GG has edge connectivity

λ(G)=k. \lambda(G)=k.

Then the most likely way for the graph to fail, when pp is close to 11, is often the failure of all edges in a minimum cut of size kk.

If there are many minimum cuts, the graph may be less reliable than another graph with the same value of λ(G)\lambda(G) but fewer minimum cuts.

Thus reliability depends on more than the minimum cut size.

It depends on:

FactorEffect
Edge connectivityMinimum number of edge failures needed
Number of minimum cutsNumber of most vulnerable failure modes
Larger cut structureFailure probability beyond leading terms
Edge probabilitiesNonuniform risk across the network
Network topologyRedundancy and bottlenecks

A graph with the same connectivity can have different reliability because its weak cuts may be arranged differently.

51.12 Reliability and Menger’s Theorem

Menger’s theorem gives a path interpretation of reliability.

If there are kk edge-disjoint paths between two vertices ss and tt, then no set of fewer than kk edge failures can separate them.

Thus edge-disjoint paths provide route redundancy.

Similarly, internally vertex-disjoint paths provide protection against vertex failures.

In network design, this gives a concrete rule:

To improve reliability between two terminals, provide many independent routes between them.

However, independence in graph theory and independence in probability are different ideas. Edge-disjoint paths do not share edges, but their failures may still be statistically dependent in a real system. For example, two cables may be edge-disjoint in a logical graph but run through the same physical conduit.

Graph reliability models are useful only when the graph abstraction matches the failure mechanisms.

51.13 Series and Parallel Structures

Simple reliability calculations often reduce to series and parallel structures.

In a series connection, all components are required. If any component fails, the system fails.

For independent components with operational probabilities

p1,p2,,pk, p_1,p_2,\ldots,p_k,

the series reliability is

p1p2pk. p_1p_2\cdots p_k.

A path is a series system.

In a parallel connection, at least one component must work. If components fail independently, the parallel reliability is

1(1p1)(1p2)(1pk). 1-(1-p_1)(1-p_2)\cdots(1-p_k).

Parallel edges or independent routes create redundancy.

Many reliability computations decompose a graph into combinations of series and parallel parts. General graphs, however, rarely decompose completely in this way.

51.14 Complexity

Exact network reliability is computationally difficult.

The number of possible edge states is exponential in the number of edges. More formally, computing the probability that a random subgraph is connected is called the network reliability problem, and it is #P-hard. The related two-terminal problem asks whether two specified vertices remain connected and is also a standard reliability problem.

This hardness explains why practical methods often use:

MethodPurpose
Deletion-contractionExact recursive computation
Factoring by cut vertices or bridgesReduce graph size
Monte Carlo simulationEstimate reliability
BoundsGive certified upper and lower estimates
Minimal cut enumerationApproximate high-reliability behavior
Binary decision diagramsCompactly represent state spaces

Exact formulas are feasible for small or structured graphs. Large networks usually require approximation or special structure.

51.15 Deletion-Contraction

Deletion-contraction is a basic recurrence for reliability polynomials.

Choose an edge ee. There are two cases:

  1. ee fails.
  2. ee works.

If ee fails, the graph becomes

Ge. G-e.

If ee works, its endpoints may be contracted for connectivity purposes, giving

G/e. G/e.

When every edge has operational probability pp, the recurrence has the form

RG(p)=(1p)RGe(p)+pRG/e(p), R_G(p) = (1-p)R_{G-e}(p)+pR_{G/e}(p),

with suitable base cases.

This recurrence is exact. Its difficulty is that repeated deletion and contraction create many subproblems. Without memoization or structure, the computation grows exponentially.

51.16 Monte Carlo Estimation

Monte Carlo simulation estimates reliability by sampling many random network states.

The method is simple:

  1. Generate a random operational subgraph.
  2. Test whether the required connectivity condition holds.
  3. Repeat many times.
  4. Estimate reliability by the fraction of successful trials.

If NN trials are run and XX are successful, then the estimator is

R^=XN. \widehat{R}=\frac{X}{N}.

This estimator is unbiased under independent sampling.

Monte Carlo methods are flexible. They can handle large graphs, nonuniform probabilities, vertex failures, edge failures, and mixed models. Their limitation is statistical error, especially when failure events are rare.

For highly reliable systems, naive simulation may need many samples to observe enough failures. Rare-event simulation and importance sampling are used to improve efficiency.

51.17 Design Principles

Graph theory suggests several design principles for reliable networks.

First, avoid bridges. A bridge is a single edge failure that disconnects the graph.

Second, avoid articulation points. An articulation point is a single vertex failure that disconnects the graph.

Third, increase edge-disjoint and vertex-disjoint paths between important terminals.

Fourth, avoid small nontrivial cuts between large regions.

Fifth, consider the number of minimum cuts, not only their size.

Sixth, match the graph model to real failure modes. Two logical links may fail together if they share physical infrastructure.

These principles are structural. They do not replace engineering constraints such as cost, geography, latency, bandwidth, power, and maintenance.

51.18 Example: Comparing Two Designs

Consider two networks on four vertices.

The first is a path:

1234. 1-2-3-4.

Its reliability is

R(p)=p3. R(p)=p^3.

The second is a cycle:

12341. 1-2-3-4-1.

Its reliability is

R(p)=p4+4p3(1p). R(p)=p^4+4p^3(1-p).

At p=0.9p=0.9, the path has reliability

0.93=0.729. 0.9^3=0.729.

The cycle has reliability

0.94+4(0.9)3(0.1)=0.9477. 0.9^4+4(0.9)^3(0.1)=0.9477.

Adding one edge greatly improves reliability because the network can survive any single edge failure.

This example illustrates the value of redundancy. A small increase in edge count can produce a large increase in survival probability.

51.19 Limitations of Graph Reliability

Graph reliability models are abstractions.

They usually assume independent failures, fixed probabilities, and a clear success condition. Real systems may violate all three assumptions.

Failures may be correlated. A storm, power outage, software bug, or shared conduit may disable many components together.

Probabilities may change over time. Components age, loads vary, and maintenance changes risk.

Success may be graded rather than binary. A network may remain connected but deliver poor latency, insufficient capacity, or degraded quality.

For these reasons, graph reliability is best used as one layer of analysis. It gives a precise mathematical model of structural survival, but it must be interpreted with the system context.

51.20 Summary

Network reliability studies the probability that a graph remains functional after random failures.

ConceptMeaning
Edge reliabilityEdges work or fail probabilistically
Vertex reliabilityVertices work or fail probabilistically
All-terminal reliabilityThe whole graph remains connected
Two-terminal reliabilityTwo specified vertices remain connected
Reliability polynomialProbability of connectivity as a function of pp
Minimum cutSmall failure set that disconnects the network
Monte Carlo estimationSimulation-based reliability approximation

Connectivity gives worst-case guarantees. Reliability gives probability estimates. Together they describe how a graph behaves under failure, both structurally and statistically.