Enjoy the evolving life: June 2009

Tuesday, June 30, 2009

/proc vs. /dev

/dev and /proc files are both means for user space programs to communicate with linux kernel but for different purposes.

/dev: file system for devices and stored in hard drive (not the same to DevFS which is also in RAM). (Static)

/proc: file system for processes stored in memory. /proc is an in-memory file system used to provide file based communication with the kernel. Also, to provide the user space applications the information about kernel, modules in the kernel sometimes create entries in /proc about their state information. Often, entries (files and directories) under /proc will refer to (or explain) files under /dev. /proc/devices lists all of the devices (divided into the "block" and "character" categories) available on the system. Any kernel component that wants to communicate with the user can create a file under the /proc and that can be used to exchange data. (Otherwise, it will have to create a system call.) (Dynamic) The sysctl command is another option for dynamic kernel configuration. Here is a good and simpler example of how to create and use entry defined in /proc from a loadable kernel module, and how the memory should be handled. In fact, /proc/devices will always provide you a list of available devices in the system (dynamically updated). This is often being used to create (mknod) /dev file automatically after loadable module is installed via insmod.

Interact with kernel processes

1. via /proc (with kernel): Processes can using files in /proc as media to communicate, between user space and kernel space. Parameters in /proc are used for tuning hardware and kernel internal settings. The reading and writing handlers are the channels for this communication to happen. They could be either blocking or non-blocking mode. This entry can be used by fread and fwrite from user space programs. User program <-> /proc <-> kernel components. /proc file can also be used for hardware configuration.

2. via /dev (with kernel): Processes access /dev file with the intention to issue I/O control or receive I/O data via read, write or mmap. Kernel will recognise these I/O requests and call requested device drivers (kernel modules), which will then communicate with the hardware. Compared with /proc, which could also be used to change hardware configuration, /dev files focus on using configured hardware device rather than configuring. User program <-> /dev/* <-> device driver <-> hardware I/O. Although there is not a one to one relationship between /dev file and hardware piece in the system (except network interface cards). When an user program accesses a /dev file, kernel recognise the device driver based on the /dev file major and minor number (the kernel module numer).

Major number identify the device driver and the minor number is used to identify one of the possible multiple devices managed by this device driver. The kernel uses the major number to redirect an I/O request to the appropriate driver, and the driver uses the minor number to figure out which specific device to access.
You can also access one device module from another kernel device module using try_module_get/module_put/symbol_get/symbol_put/symbol_request, for ensuring loading of the other module, and the fact that it is not unloaded during usage. As long as the functions in the first module has been exposed via EXPORT_SYMBOL (or EXPORT_SYMBOL_GPL) macros.
User space linux APIs applied on /dev files will invoke system calls which will then be passed to device drivers, if the operation is supported. The supported system calls are implemented as file operations, defined in file_operations structure within linux/fs.h , as a collection of function pointers, need to be implemented for device driver to be invoked correctly. There is a good analogy between available system calls applicable for a file and the file operations in the file_operations structure. (Am I just repeating myself? :))
Device driver is one type of kernel module. This means kernel module does not have to implement any of the file operations other than module_init and module_exit.
Of course the beloved ioctl provides all the freedom you need if you do not like any restriction as to the number of functions you want to implement.

On slightly different subject, some special /dev files also can be used solely for IPC between user space programs such as /dev/null, /dev/full, /dev/zero. Particularly, you can create named pipe using mkfifo, which will internally call mknod to create a special /dev file until explicitly deletion. Once created, /dev fifo file can be used for IPC communication between two processes no matter which spaces they live in. Fifo /dev file requires two parties to join the communication with support for both blocking and non-blocking mode accesses.

Implementation

1. /proc entry: to create an entry under /proc in your kernel module, use create_proc_entry().

Normally, there is only read_proc and write_proc for you to create a /proc entry in the kernel module.

2. /dev file: mknod or MAKEDEV or mkfifo to creates a FIFO (pipe), character special file, or block special file with the specified name. "special file" means something that can generate or receive data.

Unfortunately, you have to do this manually when you implement your own new device driver in the linux kernel. Of course you will also be responsible for the module major and minor number allocation although there are kernel APIs which can help you find available numbers. mknod -m permissions name type major minor
You can change the permission of /dev files after creation but need to be careful as you are opening a window to the underneath world!
To support a /dev file to access hardware devices, we need to implement device driver.

Friday, June 26, 2009

Patch the patches

If you have suffered the pain of patching Linux kernel for either driver changes or target device support, I am sure you are just as unimpressed as I am when the patch fails with absolutely no indications, no logs, no nothing, in the middle of file update. With some very careful tracing, it turns out to be that your applications requires you to change the same files which are set for patching, especially those from third parties.

Now here is a clean thread,

1. You have a target device, for argument sake, you need to change some of the kernel driver files to accommodate those hardware various, or even just you fancy ideas of changing Linux itself. After all, that's what you are expecting from open source OS, isn't it?

2. In parallel, you wanted to patch your existing Linux source distribution with some other third party addon, such as RTAI or Xenomai.

In my case, I had a problem that both of them require changes in xxx_gpio.c file, and same lines! The third requirement is, I do not want to manually do the file copy thing or scripted file copying thing every time when I rebuild the whole lot. After a nice lunch time beer, this is what I have done.

Step 1. file_ver0.c => Apply third party changes (only because they are normally dumber!) => file_ver1.c

Step 2. file_ver1.c => Apply your changes on top of third party changes => file_ver2.c

Step 3. Do a diff -u file_ver0.c file_ver2.c > diff.txt

Step 4. Replace the patch section of this file in the original .patch file with content in diff.txt.

Step 5. Or create your own patch file and do patch -p1 < whatever.patch

Ok, this is done once for all.

Sunday, June 21, 2009

For those who love Chrome

Chromium is an open source browser project. In case you have not realised by now, it is the code base where Google Chrome grew from. Crossover chromium is a Mac and Linux port project, where I have to eventually try my luck to get similar user experience of browser of Chrome on Ubuntu. There are some posts introducing way of installing Chrome via Wine. I do not particularly like the idea to install another layer of run time engine between browser and os as one of the nice things about Chrome is its low latency, which is the not exactly the whole idea of wine anyway (portability of windows applications to Linux).

To keep this short, get Chromium up and running, the following steps works for me (Hardy 8.04):

First thing first, you are unlucky as I were, you have to upgrade to 8.10 at least to be able to run Chromium. Regardless how many people say it works, my finding is, yes, the install will be completed with warnings on a few dependencies. The worst thing is, the browser is barely usable and crashes very easily.

So do a "update-manager -d" to get you from 8.04 to 8.10; another one will then get you upgraded to 9.04.

1. Download .deb package

wget http://media.codeweavers.com/pub/crossover/chromium/cxchromium_0.9.0-1_i386.deb

2. Install deb package

sudo dpkg -i cxchromium_0.9.0-1_i386.deb

3. Using Ubuntu PPA

sudo gedit /etc/apt/sources.list

Add the following two lines

deb http://ppa.launchpad.net/chromium-daily/ppa/ubuntu intrepid main

deb-src http://ppa.launchpad.net/chromium-daily/ppa/ubuntu intrepid main

Do NOT use "jaunty" version on Hardy otherwise you will have a crashing install. "jaunty" is for 9.04. Save and exit the file

4. Add and activate GPG key

sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xfbef0d696de1c72ba5a835fe5a9bf3bb4e5e17b5

5. Update source list in the repository

sudo apt-get update

6. Finally, get chromium installed!

sudo apt-get install chromium-browser

Now go to Applications->Internet->Chromium Web Browser, and you are sorted.

Wednesday, June 10, 2009

Distributed system notes - Part IV

Parallel computing vs. Distributed computing - Get down another level

Flynn gave the taxonomy for distributed architecture as: SISD, SIMD, MISD and MIMD.

To sub-classify MIMD by

Memory

Shared memory system -> multi-processors

No shared memory system -> network of computers

Interconnection

Bus

Switch (Circuit-switched and packet-switched)

Delay/bandwidth

Tightly coupled system - Processors share most resources inc. Memory. The communication is achieved over shared bus too. (Not very flexible and scalable, fault sensitive)

Loosely coupled system - Most resources are not shared with each processor normally own private cache (in tight coupled system too) and dedicated memory. The communication is achieved via messaging over explicit communication link or shared virtual memory.

Five categories in the degree of granularity

Grain size	Description	Synchronisation interval (Instructions)
Fine	Parallelism inherent in a single instruction stream	<>
Medium	Parallel process of multi-tasking in a single application program	20 ~ 200
Coarse	Multiprocessing of concurrent processes in multiprogramming environment	200 ~ 2000
Very Coarse	Distributed processing across network nodes to form a single computing environment	2000 ~ 1M
Independent	Multiple unrelated process	N/A

Intra-machine

Machine architecture

Interconnection

Symmetric Multi-Processing (SMP) - all CPUs connected to single backbone bus

This way memory and peripheral I/O will be accessed via the same shared bus.

Local cache is used to avoid bus overload
To maintain memory coherence -> Snoopy cache

Switched multi-processors
SMP doesn't really scale well. Switched multi-processors divide memory into groups and connect memory chunks with CPUs via a crossbar switch. (n² switch crosspoint, constant 1 switching stage)
To reduce the number of crosspoints, Omega network introduces more switching stage. (for each CPU <-> memory chunk combination, there is a log₂n switching stage, each with n/2 switches; totally n*log₂n/2 switches)
NUMA - hierarchical memory system
HyperCube

Approach to share OS responsibility

Master/slave - Master process executes most of the operating system code, inc. I/O tasks, with slaves performing processor-bound user programs. Thus slave processors normally have to wait master to handle the interrupt for OS level operations.
Separate kernels - Each CPU runs its own OS and responds only to its local processes. Synchronisation can be achieved via critical region or software/hardware (TSL) level spinlock.
Symmetrical (shared OS) - OS is in charge here. OS manages a pool of identical processors with high amount of resource sharing as well as mutual exclusions. This approach allows OS maximise parallelism potential and process to utilise all computation power as much as possible with the load significantly on OS.

Memory access

Uniform memory access (UMA) - All processors share main memory. This would typically be used in shared bus or SMP system.
Non-uniform memory access (NUMA) - Global memory is divided into partitions which are then allocated to processors as local memory. This avoid bus conflicts and quicker local memory access for processors.
COMA
NORMA

OS Scheduling - determine the order and to which processor the process is assigned and dispatched to

Issue break down

Assignment of processes to processors
Use of multiprogramming on individual processor
Actual dispatching of a process

Goals

Achieve parallelism

Timesharing - synchronised single queue based scheduling
Synchronisation of parallel programs - Gang scheduling

Achieve processor Cache affinity - affinity based scheduling, space-partitioning scheduling (what that means is for example, an interrupted process has already put its data, state on the L1 cache of one processor, to move/migrate to the other processor based on naive rescheduling will slow waste the cached info hence slow things down. This could be even worse if underlining architecture is NUMA rather than symmetric access.)

Soft affinity
Hard affinity

Types of scheduling

Job-blind - direct extension from single processor scheduling

FIFO, EDF, Round-robin, shortest process first

Job-aware

The ones to maximise parallelism - shortest-number-of-processes-first, Round-robin job, co-scheduling (MIS ?)
The ones to maximise affinity - dynamic partitioning

Process migration - minimal residual dependency

Request initiated from either sender processor or receiver
Sender suspends the migrating process
Sender creates message queue for migrating process
Sender transmit the migrating process state to a dummy process in receiver - there are various migration strategies as far as how to copy memory pages is concerned.
Sender forward the messages to receiver
Sender and receiver notify the others of the migration
Sender kills the original instance of the process it holds

Load balancing - Queuing theory

Static load balancing - Distribute the load and minimise communication cost
Dynamic load balancing - Present communication challenge

Policies - Sender initiated, receiver initiated, symmetric, random
Algorithms - Bidding, drafting

Case study - background comparison between Windows and Linux.

Linux 2.6 multi-processor scheduling - Prior to 2.6.23, Linux kernel only supports one runqueue per CPU. The introduction of Completely Fair Scheduler (CFS, complexity increased to O(logn) as opposed to O(1)) in 2.6.23, is to improve the separation between I/O intensive tasks and CPU intensive tasks, which was originally estimated by penalising CPU intense processes and favor for I/O intense via tuning the priorities. Bear in mind Linux has mainly been used in symmetrical shared OS structure as far as OS responsibility is concerned. The underneath data structure used by CFS is a red-black tree per runqueue. Scheduling domain based Load balancing was introduced in 2.6.17 to address unsatisfactory Linux scheduling performance on SMP and NUMA. The hierarchy of scheduling domain is built on actual hardware resources available. Load balancing takes place at the domain level between different groups. Linux also introduced energy saving scheduling. A good reference on this can be found on Intel's website.

One queue per processor - a run queue (containing multiple priority array to achieve O(1) scheduling selection as in single processor's case) per processor
Same task tends to stay in one queue - to achieve cache affinity
Load balancing - move tasks around queues
No native support for gang scheduling as yet.

Windows multi-process scheduling - The scheduling on windows is thread-based, with priority from 0 to 31 (as opposed to Linux 140 priority within which 100 is for real time tasks). These threads will be allocated time slices on round-robin fashion. Scheduling on SMP system, Windows introduces the notion of a thread's processor affinity and ideal processor for a given thread.

Inter-machine

Architecture

Centralised model - Terminal-based computing service (slot) provider
Client-server model
P2P - Each peer has equivalent capability => no peer is dedicated to only serve others. Publisher-consumer design patter, brokerless messaging.
Processor pool model - Task scheduling, mapreduce
Grid computing - for heterogeneous and geographically distributed system, providing seamless access to available resources.
Multi-tier C/S

Remote procedure call - To execute function/service defined within other components of the system

Stub based design. An interface definition stage is involved during the design to product consistent stubs to be used both in RPC invokers and invokees. The execution of RPC could either be blocking or non-blocking, in other words, synchronous or asynchronous.
Parameters have to be marshalled -> serialised -> transmitted -> de-serialised -> unmarshalled, by value in essence (be aware of different data representation in different physical nodes)
How to locate the RPC service provider and how/when to bind?
Name services operations
Binding operations (via appropriate protocols)
Endpoints operations
Security operations
Data conversion (marshalling)
Stub memory management (temporarily buffer 'reference' data)
RPC API calls (compatibility, fault tolerance)

Case studies

Native RPC

Sun RPC - also known as ONC RPC, interface defined in IDL (rpcgen compiler). Early binding approach, not by procedure call basis. It requires programmer to pick a unique interface definition ID.

DCE RPC - Interface is defined in IDN file. Integrated with DCE's security service and directory service. uuidgen is used to produce the unique ID.

Object RPC

MS DCOM - Client connects to SCM (Service Control Manager, running on server) to request the creation of object on server. Underneath data transfer DCOM uses DCE RPC protocol packets + version info + IPID (remote object references). MIDL file is compiled by IDL compiler to produce C++ code, proxy (client) and stub (server), both are COM objects. One key point is the object (in the server) lifetime is controlled by remote reference counting.
CORBA - OS and language independent. To achieve, request services of a distributed object. ORB (Object request broker) delivers the client request to the object and return results to client. The interfaces are defined in IDL, interface definition language (Yes, it is a real language!), which will then be compiled to target language client or server use, to produce stub and skeleton. ORB takes a lot of responsibility in this case such as locating remote object, sending request, collecting result. This way, the client and object implementation could be on completely different platform in different language, not like DCOM. IIOP is defined as an inter-ORB network protocol. There are many CORBA standard services providing services as naming, events, managing object life cycle, transaction, load balancing etc.

Java RMI - using Object Registry to relate objects to names within name servers. rmic is the compiler to produce stub and skeleton class files. In the server side, the object implementation needs to register itself with the Object registry via binding, after which, client will inquiry rmiregistry to lookup name. It will then return a remote reference handle to the client for it to invoke exposed methods.

Webservices - the third generation of RPC implementation. The goal is to have remote hosted services behind firewalls could be utilised.

XML RPC - marshal data into XML format with explicit typing, transported over standard HTTP protocol to get around firewall restrictions. There are standard XML-RPC data types defined, as opposed to native data types.
SOAP - to generalise XML RPC, SOAP still defines message in XML format but not necessary RPC, instead SOAP adopts XML over HTTP. Web service interface is defined in WSDL, which is exchangeable. In terms of SOAP v POX (plain-old-XML), AJAX adopts the standard XML approach to send and receive data, allowing client side javascript to initiate asynchronous HTTP request (REST) and process the result at object level. Google has obviously spit out its favorite as POX.
MS .Net Remoting - I have briefly covered .Net remoting in previous blog. Aiming at replacing low level DCOM, MS introduced Remoting on top of the new common language runtime (CLR) as the object run time environment. Webservice will be compiled into immediate language (IL) format whichever language was chosen to implement. CLR just-in-time compiler will generate native code for the first loaded programs. (It is possible to pre-load the interpreted assembly into target machine instruction binary to global assembly cache though.) Object derived from MarshalByRefObject is deemed as remote objects. The communication is supported in XML, SOAP or binary encoding, through HTTP, TCP and SMTP protocol channel. In summary, .Net also has an explicit set of Web Service, based on HTTP (communication protocol) + XML (data format) + SOAP (format to request service) + WSDL (format for defining service) + UDDI (protocol for discovering service). I hope you can see the difference between Remoting and .Net Web Service, as the latter is a fixed combination and higher functional level.

Tuesday, June 09, 2009

3G in China

Market development

1. Step one - standard

Chinese market, IP, Diversity of suppliers. multiple standards presents challenges but also opportunities to the suppliers at different level of the food chain.

TD-SCDMA go abroad by directing/influencing suppliers, technology via massive Chinese market.

2. Step one - network (200 billions RMB in 2009 alone)

Three carriers are being licensed for 3G network infrastructure construction in Jan 2009:

China Mobile - TD-SCDMA

China Telcom - CDMA2000

China Unicom - WCDMA

HuaWei, ZTE will be playing critical roles in infra-structure construction, so as many third party suppliers.

3. Step three - UE

In the next 3 years, 3G mobile phone, data card etc. is projected/planned at 400 billions RMB, more than 40 billions GBP.

Diversity of suppliers

Chipset: Icera, Qualcomm

Device: All the big names + local device vendors

China Mobile joined Google's Open Mobile platform alliance, adopting Android as mobile OS. Very much like DOCOMO in Japan.

WCDMA - G1, iPhone, N97 (Nokia), X1 (Sony Ericsson), HTC Touch HD

CDMA2000 - Blackberry, D90 (ZTE), ZN4 (Motorola), SCH-U940(Samsung)

TD-SCDMA - L800t (Motorola), S700 (Dupuda), U981 (ZTE), TD800 (Lenovo)

4. Step four - service provider, content provider

Application vendors

Value-added service provider

Technical stuff

Baseband processor

Network infrastructure - & integration with each other
Node B implementaion
Integrate existing network

Challengess
Wireless Communication Channel, multiple modulation and channel coding standard co-exist (Soft modem)
Packet Network, modified TCP/IP implmentation on top of MAC and link layer control in application processor.
Small packaging and integrated fabric
Products

Data card - Qualcomm's Gobi, Icera's soft modem

Mobile device - reference design

Monday, June 08, 2009

Distributed system notes - Part III

Virtulisation

Memory virtulisation - process owns individual memory space. Created by MMU, configured by OS
Storage virtulisation - Logical storage images. DFS (distributed file system) is one typical example. Another one is SAN (Storage Area Network).
CPU/Machine virtulisation - Process has its own CPU, machine resources, created by OS pre-emption and scheduler.

For unprivileged instructions, they will be interpreted to the physical CPU instruction sets.
For privileged instruction, VM will trap these instructions and emulate them. However, X86 doesn't support Trap. So either we have to replace these instructions (VMWare) or just don't use them (Xen).
Virtual Machine Monitor - Hypervisor, the program responsible for the virtulisation

Arbitrate accesses to physical resources (hypervisor sometime can be run on top of a thin layer OS itself - Xen.)
Present a set of virtual device interface to each host OS
Hypervisor-based rootkits
VMWare, Parallels, Xen, MS Virtual PC, xVM Virtual Box, Altris

Clustering

Clustering is to achieve reliability and scalability by interconnectin multiple autonomous standard servers into appear-to-be single machine image.

Existing clustering types:

High Performance (Super)-Computing (HPC)

Typically Linux + message passing software + remote exec + remote monitoring
Interconnect network isolated from external network
Network load is determined only by application
Global process ID provided, Global signaling mechanism

Batch processing

Single-queue work distribution
Application specific: graphics rendering
Dispatcher remotely exec process
A portable Batch System - OpenPBS

Grid Computing

Not application-specific, use general-purpose protocols for accessing resources, authenticating, discovery etc.
Coordinate computing resources that are not subject to centralized control

Open Grid Services Architecture (Globus Toolkit - open source software tool kit used for building Grid systems and applications.)

All grid resources are modeled as services
Protocols

Grid Resource Information Protocol (GRIP), Register resources; based on LDAP
Grid Resource Access and Management Protocol (GRAM) Allocation & monitoring of resources; based on HTTP
GridFTP, Data access

Service

Define service’s interface (GWSDL)
Implement service (Java)
Define deployment parameters (WSDD)
Compile & Generate GAR file (Ant)
Deploy service with Ant

High availabitlity (HA)

Load balancing - Redirect request, most router has now load balancing capability, simple NAT table load balancing.

Software load balancing - Forward request via load balancing (IBM interactive network dispatcher)

Leaves original source address
Load balancer not in path of outgoing traffic (high bandwidth)
Kernel extensions for routing TCP and UDP requests

Each client accepts connections on its own address and dispatcher’s address
Dispatcher changes MAC address of packets.

Load balancing Router - Virtual Router Redundancy Protocol, LSNAT protocol

Pick machine with least # TCP connections
Factor in weights when selecting machines
Pick machines round-robin
Pick fastest connecting machine (SYN/ACK time)
Port level redirection
Multiple virtual addresses to be assigned to physical address
Manage Session during redirections.

Network load balancing, component load balancing

Multi-tier clustering system

Top tier: cluster abstractions - Failover manager, resource monitor, cluster registry
Middle tier: distributed operations - Global status update, quorum (keeps track of who’s in charge), membership
Bottom tier: OS and drivers - Cluster disk driver, cluster network drivers, IP address takeover

Clock synchronisation

Physical clock - Keep consistent wall clock across all components.

Problems: Clock drift and skew => NTP and SNTP protocol

Logical clock - Maintain logical relationship between events. The key is the sequence of the messages, which each component maintains its own

Partial ordering - Lamport's algorithm

Each message carries the physical timestamp from the sender;
When a message arrives, if the receiver's physical clock < timestamp =""> set receiver's clock to message's timestamp + 1, otherwise do nothing;
Clock must be advanced between two events in the same process.

When there is no logical sequential ordering among events in different components, i.e. no send-receive causal relationship, we have concurrency.

Total ordering

Force each timestamp in the system to be unique in {T_i , i}. "i" is the process identifier in the system. Total ordering does not related to event ordering.
Use Vector clock to analyse and identify concurrent events in a distributed system.

Distributed system notes - Part II

Intra-machine parallel computing vs. inter-machines distributed computing

Find the parallel task potential and independent data sets, or both -> commutative property.

Parallel computing at intra-machine level normally is achieved via multi-threading on single CPU or multi-processing across multiple CPUs (but sharing memory and I/O resources). The key is how to build in parallelism into the task, process and thread level design by exploiting the implicit commutative property of the problem. Here synchronisation on the shared resource and dependency over the similar non-commutative logic is the main difficulties.

Distributed computing at inter-machines, connected through homogeneous or heterogeneous network, has underlining limitation (proved by consensus attack problem). In real world scenarios, we always either simplify the network model (i.e. client-server, peer-to-peer) with certain constraints, or assuming probability in the failures. Hence the fundamental challenges in inter-machine distributed system is communication and messaging protocol plusshared, synchronised and consistent data support.

Message Passing Interface (MPI) - Open MPI

Explicit data control over nodes; 1-n, n-1 and n-n messaging support; support high performance localised applications; defined synchronisation and shared virtual memory; explicit distributed processes control.Open MPI implementation.

MapReduce

Less direct process and data control in distributed nodes with hidden complexity, based on DFSand JobTracker.

MapReduce -> Map (parallel tasks working on independent raw data sets) + Reduce (parallel tasks working on independent hash key sets)

The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the MapReduce execution is available in the R output files. To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.

Distributed system notes - Part I

To be truly reliable, a distributed system must have the following characteristics:

Fault-Tolerant: It can recover from component failures without performing incorrect actions.
Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
Predictable Performance: The ability to provide desired responsiveness in a timely manner.
Secure: The system authenticates access to data and services

8 Fallacies or assumptions

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Design principles

As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state.
Explicitly define failure scenarios and identify how likely each one might occur.
Both clients and servers must be able to deal with unresponsive senders/receivers.
Think carefully about how much data HAVE to be sent over the network. Minimize traffic as much as possible.
Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise.
Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. Do checksums or validity checks on data to verify that the data has not changed.
Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it.
If a process stores information that can't be reconstructed, then problems arise as single point of failure. To deal with this issue, Replication strategies are also useful in mitigating the risks of maintaining state. But synchronizing multiple Replications is another problem.There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.
Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more.
Acks are expensive and tend to be avoided in distributed systems wherever possible.
Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.

Fault tolerance

Failure is the defining difference between distributed and local programming.

Since failure (either it is transient, intermittent or permanent) is unavoidable in distributed system, (referring to Consensus attack problem). Nowadays, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures.

Software residual bugs in mature systems can be classified into two main categories.

Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).
Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions.

Types of failures that can occur in a distributed system:

Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure.
Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop.
Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
Network failures: A network link breaks.
Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure.
Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc.
Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc.

To achieve fault tolerance, we normally apply redundancy to the system.

Information redundancy - replicating or coding the data. For example, a Hamming code can provide extra bits in data to recover a certain ratio of failed bits. Sample uses of information redundancy are parity memory, ECC (Error Correcting Codes) memory, and ECC codes on data blocks.
Time redundancy - performing an operation several times. Timeouts and retransmissions in reliable point-to-point and group communication are examples of time redundancy. This form of redundancy is useful in the presence of transient or intermittent faults. It is of no use with permanent faults.
Physical redundancy - deals with devices, not data. We add extra equipment to enable the system to tolerate the loss of some failed components. RAID disks and backup name servers are examples of physical redundancy.

Active Replication - TMR (Triple modular redundancy)
Primary & backup, heartbeat is required periodically between servers.

Impossibility of the agreement

Faulty communication channel - Referring to consensus attack problem
Faulty distributed component - Byzantine General problem, agreement can be reached, but requires significant amount of additional nodes and messages transfer.