The most exciting thing about this world is its ever changing quality.

Monday, June 08, 2009

Distributed system notes - Part II

Intra-machine parallel computing vs. inter-machines distributed computing

Find the parallel task potential and independent data sets, or both -> commutative property.

Parallel computing at intra-machine level normally is achieved via multi-threading on single CPU or multi-processing across multiple CPUs (but sharing memory and I/O resources). The key is how to build in parallelism into the task, process and thread level design by exploiting the implicit commutative property of the problem. Here synchronisation on the shared resource and dependency over the similar non-commutative logic is the main difficulties.

Distributed computing at inter-machines, connected through homogeneous or heterogeneous network, has underlining limitation (proved by consensus attack problem). In real world scenarios, we always either simplify the network model (i.e. client-server, peer-to-peer) with certain constraints, or assuming probability in the failures. Hence the fundamental challenges in inter-machine distributed system is communication and messaging protocol plusshared, synchronised and consistent data support.

Message Passing Interface (MPI) - Open MPI

Explicit data control over nodes; 1-n, n-1 and n-n messaging support; support high performance localised applications; defined synchronisation and shared virtual memory; explicit distributed processes control.Open MPI implementation.

MapReduce

Less direct process and data control in distributed nodes with hidden complexity, based on DFSand JobTracker.

MapReduce -> Map (parallel tasks working on independent raw data sets) + Reduce (parallel tasks working on independent hash key sets)

  1. The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.

  2. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

  3. A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.

  4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

  5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.

  6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

  7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.


After successful completion, the output of the MapReduce execution is available in the R output files. To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

No comments: