Enjoy the evolving life: Distributed system notes

Parallel computing vs. Distributed computing - Get down another level

Flynn gave the taxonomy for distributed architecture as: SISD, SIMD, MISD and MIMD.

To sub-classify MIMD by

Memory

Shared memory system -> multi-processors

No shared memory system -> network of computers

Interconnection

Bus

Switch (Circuit-switched and packet-switched)

Delay/bandwidth

Tightly coupled system - Processors share most resources inc. Memory. The communication is achieved over shared bus too. (Not very flexible and scalable, fault sensitive)

Loosely coupled system - Most resources are not shared with each processor normally own private cache (in tight coupled system too) and dedicated memory. The communication is achieved via messaging over explicit communication link or shared virtual memory.

Five categories in the degree of granularity

Grain size	Description	Synchronisation interval (Instructions)
Fine	Parallelism inherent in a single instruction stream	<>
Medium	Parallel process of multi-tasking in a single application program	20 ~ 200
Coarse	Multiprocessing of concurrent processes in multiprogramming environment	200 ~ 2000
Very Coarse	Distributed processing across network nodes to form a single computing environment	2000 ~ 1M
Independent	Multiple unrelated process	N/A

Intra-machine

Machine architecture

Interconnection

Symmetric Multi-Processing (SMP) - all CPUs connected to single backbone bus

This way memory and peripheral I/O will be accessed via the same shared bus.

Local cache is used to avoid bus overload
To maintain memory coherence -> Snoopy cache

Switched multi-processors
SMP doesn't really scale well. Switched multi-processors divide memory into groups and connect memory chunks with CPUs via a crossbar switch. (n² switch crosspoint, constant 1 switching stage)
To reduce the number of crosspoints, Omega network introduces more switching stage. (for each CPU <-> memory chunk combination, there is a log₂n switching stage, each with n/2 switches; totally n*log₂n/2 switches)
NUMA - hierarchical memory system
HyperCube

Approach to share OS responsibility

Master/slave - Master process executes most of the operating system code, inc. I/O tasks, with slaves performing processor-bound user programs. Thus slave processors normally have to wait master to handle the interrupt for OS level operations.
Separate kernels - Each CPU runs its own OS and responds only to its local processes. Synchronisation can be achieved via critical region or software/hardware (TSL) level spinlock.
Symmetrical (shared OS) - OS is in charge here. OS manages a pool of identical processors with high amount of resource sharing as well as mutual exclusions. This approach allows OS maximise parallelism potential and process to utilise all computation power as much as possible with the load significantly on OS.

Memory access

Uniform memory access (UMA) - All processors share main memory. This would typically be used in shared bus or SMP system.
Non-uniform memory access (NUMA) - Global memory is divided into partitions which are then allocated to processors as local memory. This avoid bus conflicts and quicker local memory access for processors.
COMA
NORMA

OS Scheduling - determine the order and to which processor the process is assigned and dispatched to

Issue break down

Assignment of processes to processors
Use of multiprogramming on individual processor
Actual dispatching of a process

Goals

Achieve parallelism

Timesharing - synchronised single queue based scheduling
Synchronisation of parallel programs - Gang scheduling

Achieve processor Cache affinity - affinity based scheduling, space-partitioning scheduling (what that means is for example, an interrupted process has already put its data, state on the L1 cache of one processor, to move/migrate to the other processor based on naive rescheduling will slow waste the cached info hence slow things down. This could be even worse if underlining architecture is NUMA rather than symmetric access.)

Soft affinity
Hard affinity

Types of scheduling

Job-blind - direct extension from single processor scheduling

FIFO, EDF, Round-robin, shortest process first

Job-aware

The ones to maximise parallelism - shortest-number-of-processes-first, Round-robin job, co-scheduling (MIS ?)
The ones to maximise affinity - dynamic partitioning

Process migration - minimal residual dependency

Request initiated from either sender processor or receiver
Sender suspends the migrating process
Sender creates message queue for migrating process
Sender transmit the migrating process state to a dummy process in receiver - there are various migration strategies as far as how to copy memory pages is concerned.
Sender forward the messages to receiver
Sender and receiver notify the others of the migration
Sender kills the original instance of the process it holds

Load balancing - Queuing theory

Static load balancing - Distribute the load and minimise communication cost
Dynamic load balancing - Present communication challenge

Policies - Sender initiated, receiver initiated, symmetric, random
Algorithms - Bidding, drafting

Case study - background comparison between Windows and Linux.

Linux 2.6 multi-processor scheduling - Prior to 2.6.23, Linux kernel only supports one runqueue per CPU. The introduction of Completely Fair Scheduler (CFS, complexity increased to O(logn) as opposed to O(1)) in 2.6.23, is to improve the separation between I/O intensive tasks and CPU intensive tasks, which was originally estimated by penalising CPU intense processes and favor for I/O intense via tuning the priorities. Bear in mind Linux has mainly been used in symmetrical shared OS structure as far as OS responsibility is concerned. The underneath data structure used by CFS is a red-black tree per runqueue. Scheduling domain based Load balancing was introduced in 2.6.17 to address unsatisfactory Linux scheduling performance on SMP and NUMA. The hierarchy of scheduling domain is built on actual hardware resources available. Load balancing takes place at the domain level between different groups. Linux also introduced energy saving scheduling. A good reference on this can be found on Intel's website.

One queue per processor - a run queue (containing multiple priority array to achieve O(1) scheduling selection as in single processor's case) per processor
Same task tends to stay in one queue - to achieve cache affinity
Load balancing - move tasks around queues
No native support for gang scheduling as yet.

Windows multi-process scheduling - The scheduling on windows is thread-based, with priority from 0 to 31 (as opposed to Linux 140 priority within which 100 is for real time tasks). These threads will be allocated time slices on round-robin fashion. Scheduling on SMP system, Windows introduces the notion of a thread's processor affinity and ideal processor for a given thread.

Inter-machine

Architecture

Centralised model - Terminal-based computing service (slot) provider
Client-server model
P2P - Each peer has equivalent capability => no peer is dedicated to only serve others. Publisher-consumer design patter, brokerless messaging.
Processor pool model - Task scheduling, mapreduce
Grid computing - for heterogeneous and geographically distributed system, providing seamless access to available resources.
Multi-tier C/S

Remote procedure call - To execute function/service defined within other components of the system

Stub based design. An interface definition stage is involved during the design to product consistent stubs to be used both in RPC invokers and invokees. The execution of RPC could either be blocking or non-blocking, in other words, synchronous or asynchronous.
Parameters have to be marshalled -> serialised -> transmitted -> de-serialised -> unmarshalled, by value in essence (be aware of different data representation in different physical nodes)
How to locate the RPC service provider and how/when to bind?
Name services operations
Binding operations (via appropriate protocols)
Endpoints operations
Security operations
Data conversion (marshalling)
Stub memory management (temporarily buffer 'reference' data)
RPC API calls (compatibility, fault tolerance)

Case studies

Native RPC

Sun RPC - also known as ONC RPC, interface defined in IDL (rpcgen compiler). Early binding approach, not by procedure call basis. It requires programmer to pick a unique interface definition ID.

DCE RPC - Interface is defined in IDN file. Integrated with DCE's security service and directory service. uuidgen is used to produce the unique ID.

Object RPC

MS DCOM - Client connects to SCM (Service Control Manager, running on server) to request the creation of object on server. Underneath data transfer DCOM uses DCE RPC protocol packets + version info + IPID (remote object references). MIDL file is compiled by IDL compiler to produce C++ code, proxy (client) and stub (server), both are COM objects. One key point is the object (in the server) lifetime is controlled by remote reference counting.
CORBA - OS and language independent. To achieve, request services of a distributed object. ORB (Object request broker) delivers the client request to the object and return results to client. The interfaces are defined in IDL, interface definition language (Yes, it is a real language!), which will then be compiled to target language client or server use, to produce stub and skeleton. ORB takes a lot of responsibility in this case such as locating remote object, sending request, collecting result. This way, the client and object implementation could be on completely different platform in different language, not like DCOM. IIOP is defined as an inter-ORB network protocol. There are many CORBA standard services providing services as naming, events, managing object life cycle, transaction, load balancing etc.

Java RMI - using Object Registry to relate objects to names within name servers. rmic is the compiler to produce stub and skeleton class files. In the server side, the object implementation needs to register itself with the Object registry via binding, after which, client will inquiry rmiregistry to lookup name. It will then return a remote reference handle to the client for it to invoke exposed methods.

Webservices - the third generation of RPC implementation. The goal is to have remote hosted services behind firewalls could be utilised.

XML RPC - marshal data into XML format with explicit typing, transported over standard HTTP protocol to get around firewall restrictions. There are standard XML-RPC data types defined, as opposed to native data types.
SOAP - to generalise XML RPC, SOAP still defines message in XML format but not necessary RPC, instead SOAP adopts XML over HTTP. Web service interface is defined in WSDL, which is exchangeable. In terms of SOAP v POX (plain-old-XML), AJAX adopts the standard XML approach to send and receive data, allowing client side javascript to initiate asynchronous HTTP request (REST) and process the result at object level. Google has obviously spit out its favorite as POX.
MS .Net Remoting - I have briefly covered .Net remoting in previous blog. Aiming at replacing low level DCOM, MS introduced Remoting on top of the new common language runtime (CLR) as the object run time environment. Webservice will be compiled into immediate language (IL) format whichever language was chosen to implement. CLR just-in-time compiler will generate native code for the first loaded programs. (It is possible to pre-load the interpreted assembly into target machine instruction binary to global assembly cache though.) Object derived from MarshalByRefObject is deemed as remote objects. The communication is supported in XML, SOAP or binary encoding, through HTTP, TCP and SMTP protocol channel. In summary, .Net also has an explicit set of Web Service, based on HTTP (communication protocol) + XML (data format) + SOAP (format to request service) + WSDL (format for defining service) + UDDI (protocol for discovering service). I hope you can see the difference between Remoting and .Net Web Service, as the latter is a fixed combination and higher functional level.

Enjoy the evolving life

Wednesday, June 10, 2009

Distributed system notes - Part IV

Intra-machine

Inter-machine

No comments:

About Me