Enjoy the evolving life: Distributed system notes

To be truly reliable, a distributed system must have the following characteristics:

Fault-Tolerant: It can recover from component failures without performing incorrect actions.
Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
Predictable Performance: The ability to provide desired responsiveness in a timely manner.
Secure: The system authenticates access to data and services

8 Fallacies or assumptions

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Design principles

As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state.
Explicitly define failure scenarios and identify how likely each one might occur.
Both clients and servers must be able to deal with unresponsive senders/receivers.
Think carefully about how much data HAVE to be sent over the network. Minimize traffic as much as possible.
Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise.
Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. Do checksums or validity checks on data to verify that the data has not changed.
Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it.
If a process stores information that can't be reconstructed, then problems arise as single point of failure. To deal with this issue, Replication strategies are also useful in mitigating the risks of maintaining state. But synchronizing multiple Replications is another problem.There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.
Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more.
Acks are expensive and tend to be avoided in distributed systems wherever possible.
Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.

Fault tolerance

Failure is the defining difference between distributed and local programming.

Since failure (either it is transient, intermittent or permanent) is unavoidable in distributed system, (referring to Consensus attack problem). Nowadays, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures.

Software residual bugs in mature systems can be classified into two main categories.

Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions.

Types of failures that can occur in a distributed system:

Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
Network failures: A network link breaks.
Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.

To achieve fault tolerance, we normally apply redundancy to the system.

Information redundancy - replicating or coding the data. For example, a Hamming code can provide extra bits in data to recover a certain ratio of failed bits. Sample uses of information redundancy are parity memory, ECC (Error Correcting Codes) memory, and ECC codes on data blocks.
Time redundancy - performing an operation several times. Timeouts and retransmissions in reliable point-to-point and group communication are examples of time redundancy. This form of redundancy is useful in the presence of transient or intermittent faults. It is of no use with permanent faults.
Physical redundancy - deals with devices, not data. We add extra equipment to enable the system to tolerate the loss of some failed components. RAID disks and backup name servers are examples of physical redundancy.

Active Replication - TMR (Triple modular redundancy)
Primary & backup, heartbeat is required periodically between servers.

Impossibility of the agreement

Faulty communication channel - Referring to consensus attack problem
Faulty distributed component - Byzantine General problem, agreement can be reached, but requires significant amount of additional nodes and messages transfer.

Enjoy the evolving life

Monday, June 08, 2009

Distributed system notes - Part I

Fault tolerance

1 comment:

About Me