The most exciting thing about this world is its ever changing quality.

Wednesday, June 10, 2009

Distributed system notes - Part IV

Parallel computing vs. Distributed computing - Get down another level
Flynn gave the taxonomy for distributed architecture as: SISD, SIMD, MISD and MIMD.
To sub-classify MIMD by

  • Memory
Shared memory system -> multi-processors
No shared memory system -> network of computers

  • Interconnection
Bus
Switch (Circuit-switched and packet-switched)

  • Delay/bandwidth
Tightly coupled system - Processors share most resources inc. Memory. The communication is achieved over shared bus too. (Not very flexible and scalable, fault sensitive)
Loosely coupled system - Most resources are not shared with each processor normally own private cache (in tight coupled system too) and dedicated memory. The communication is achieved via messaging over explicit communication link or shared virtual memory.

Five categories in the degree of granularity

Grain sizeDescriptionSynchronisation interval (Instructions)
FineParallelism inherent in a single instruction stream<>
MediumParallel process of multi-tasking in a single application program20 ~ 200
CoarseMultiprocessing of concurrent processes in multiprogramming environment200 ~ 2000
Very CoarseDistributed processing across network nodes to form a single computing environment2000 ~ 1M
IndependentMultiple unrelated processN/A

Intra-machine

Machine architecture
  • Interconnection
    • Symmetric Multi-Processing (SMP) - all CPUs connected to single backbone bus
      • This way memory and peripheral I/O will be accessed via the same shared bus.
      • Local cache is used to avoid bus overload
      • To maintain memory coherence -> Snoopy cache
    • Switched multi-processors
      • SMP doesn't really scale well. Switched multi-processors divide memory into groups and connect memory chunks with CPUs via a crossbar switch. (n2 switch crosspoint, constant 1 switching stage)
      • To reduce the number of crosspoints, Omega network introduces more switching stage. (for each CPU <-> memory chunk combination, there is a log2n switching stage, each with n/2 switches; totally n*log2n/2 switches)
      • NUMA - hierarchical memory system
      • HyperCube

  • Approach to share OS responsibility
    • Master/slave - Master process executes most of the operating system code, inc. I/O tasks, with slaves performing processor-bound user programs. Thus slave processors normally have to wait master to handle the interrupt for OS level operations.
    • Separate kernels - Each CPU runs its own OS and responds only to its local processes. Synchronisation can be achieved via critical region or software/hardware (TSL) level spinlock.
    • Symmetrical (shared OS) - OS is in charge here. OS manages a pool of identical processors with high amount of resource sharing as well as mutual exclusions. This approach allows OS maximise parallelism potential and process to utilise all computation power as much as possible with the load significantly on OS.

  • Memory access
    • Uniform memory access (UMA) - All processors share main memory. This would typically be used in shared bus or SMP system.
    • Non-uniform memory access (NUMA) - Global memory is divided into partitions which are then allocated to processors as local memory. This avoid bus conflicts and quicker local memory access for processors.
    • COMA
    • NORMA

OS Scheduling - determine the order and to which processor the process is assigned and dispatched to
  • Issue break down
    • Assignment of processes to processors
    • Use of multiprogramming on individual processor
    • Actual dispatching of a process
  • Goals
    • Achieve parallelism
      • Timesharing - synchronised single queue based scheduling
      • Synchronisation of parallel programs - Gang scheduling
    • Achieve processor Cache affinity - affinity based scheduling, space-partitioning scheduling (what that means is for example, an interrupted process has already put its data, state on the L1 cache of one processor, to move/migrate to the other processor based on naive rescheduling will slow waste the cached info hence slow things down. This could be even worse if underlining architecture is NUMA rather than symmetric access.)
      • Soft affinity
      • Hard affinity
  • Types of scheduling
    • Job-blind - direct extension from single processor scheduling
      • FIFO, EDF, Round-robin, shortest process first
    • Job-aware
      • The ones to maximise parallelism - shortest-number-of-processes-first, Round-robin job, co-scheduling (MIS ?)
      • The ones to maximise affinity - dynamic partitioning
  • Process migration - minimal residual dependency
    • Request initiated from either sender processor or receiver
    • Sender suspends the migrating process
    • Sender creates message queue for migrating process
    • Sender transmit the migrating process state to a dummy process in receiver - there are various migration strategies as far as how to copy memory pages is concerned.
    • Sender forward the messages to receiver
    • Sender and receiver notify the others of the migration
    • Sender kills the original instance of the process it holds
  • Load balancing - Queuing theory
    • Static load balancing - Distribute the load and minimise communication cost
    • Dynamic load balancing - Present communication challenge
      • Policies - Sender initiated, receiver initiated, symmetric, random
      • Algorithms - Bidding, drafting
  • Case study - background comparison between Windows and Linux.
    • Linux 2.6 multi-processor scheduling - Prior to 2.6.23, Linux kernel only supports one runqueue per CPU. The introduction of Completely Fair Scheduler (CFS, complexity increased to O(logn) as opposed to O(1)) in 2.6.23, is to improve the separation between I/O intensive tasks and CPU intensive tasks, which was originally estimated by penalising CPU intense processes and favor for I/O intense via tuning the priorities. Bear in mind Linux has mainly been used in symmetrical shared OS structure as far as OS responsibility is concerned. The underneath data structure used by CFS is a red-black tree per runqueue. Scheduling domain based Load balancing was introduced in 2.6.17 to address unsatisfactory Linux scheduling performance on SMP and NUMA. The hierarchy of scheduling domain is built on actual hardware resources available. Load balancing takes place at the domain level between different groups. Linux also introduced energy saving scheduling. A good reference on this can be found on Intel's website.
      • One queue per processor - a run queue (containing multiple priority array to achieve O(1) scheduling selection as in single processor's case) per processor
      • Same task tends to stay in one queue - to achieve cache affinity
      • Load balancing - move tasks around queues
      • No native support for gang scheduling as yet.
    • Windows multi-process scheduling - The scheduling on windows is thread-based, with priority from 0 to 31 (as opposed to Linux 140 priority within which 100 is for real time tasks). These threads will be allocated time slices on round-robin fashion. Scheduling on SMP system, Windows introduces the notion of a thread's processor affinity and ideal processor for a given thread.

Inter-machine

Architecture
  • Centralised model - Terminal-based computing service (slot) provider
  • Client-server model
  • P2P - Each peer has equivalent capability => no peer is dedicated to only serve others. Publisher-consumer design patter, brokerless messaging.
  • Processor pool model - Task scheduling, mapreduce
  • Grid computing - for heterogeneous and geographically distributed system, providing seamless access to available resources.
  • Multi-tier C/S

Remote procedure call - To execute function/service defined within other components of the system
  • Stub based design. An interface definition stage is involved during the design to product consistent stubs to be used both in RPC invokers and invokees. The execution of RPC could either be blocking or non-blocking, in other words, synchronous or asynchronous.
  • Parameters have to be marshalled -> serialised -> transmitted -> de-serialised -> unmarshalled, by value in essence (be aware of different data representation in different physical nodes)
  • How to locate the RPC service provider and how/when to bind?
  • Name services operations
  • Binding operations (via appropriate protocols)
  • Endpoints operations
  • Security operations
  • Data conversion (marshalling)
  • Stub memory management (temporarily buffer 'reference' data)
  • RPC API calls (compatibility, fault tolerance)

Case studies
  • Native RPC
    • Sun RPC - also known as ONC RPC, interface defined in IDL (rpcgen compiler). Early binding approach, not by procedure call basis. It requires programmer to pick a unique interface definition ID.
    • DCE RPC - Interface is defined in IDN file. Integrated with DCE's security service and directory service. uuidgen is used to produce the unique ID.
  • Object RPC
    • MS DCOM - Client connects to SCM (Service Control Manager, running on server) to request the creation of object on server. Underneath data transfer DCOM uses DCE RPC protocol packets + version info + IPID (remote object references). MIDL file is compiled by IDL compiler to produce C++ code, proxy (client) and stub (server), both are COM objects. One key point is the object (in the server) lifetime is controlled by remote reference counting.
    • CORBA - OS and language independent. To achieve, request services of a distributed object. ORB (Object request broker) delivers the client request to the object and return results to client. The interfaces are defined in IDL, interface definition language (Yes, it is a real language!), which will then be compiled to target language client or server use, to produce stub and skeleton. ORB takes a lot of responsibility in this case such as locating remote object, sending request, collecting result. This way, the client and object implementation could be on completely different platform in different language, not like DCOM. IIOP is defined as an inter-ORB network protocol. There are many CORBA standard services providing services as naming, events, managing object life cycle, transaction, load balancing etc.

    • Java RMI - using Object Registry to relate objects to names within name servers. rmic is the compiler to produce stub and skeleton class files. In the server side, the object implementation needs to register itself with the Object registry via binding, after which, client will inquiry rmiregistry to lookup name. It will then return a remote reference handle to the client for it to invoke exposed methods.
  • Webservices - the third generation of RPC implementation. The goal is to have remote hosted services behind firewalls could be utilised.
    • XML RPC - marshal data into XML format with explicit typing, transported over standard HTTP protocol to get around firewall restrictions. There are standard XML-RPC data types defined, as opposed to native data types.
    • SOAP - to generalise XML RPC, SOAP still defines message in XML format but not necessary RPC, instead SOAP adopts XML over HTTP. Web service interface is defined in WSDL, which is exchangeable. In terms of SOAP v POX (plain-old-XML), AJAX adopts the standard XML approach to send and receive data, allowing client side javascript to initiate asynchronous HTTP request (REST) and process the result at object level. Google has obviously spit out its favorite as POX.
    • MS .Net Remoting - I have briefly covered .Net remoting in previous blog. Aiming at replacing low level DCOM, MS introduced Remoting on top of the new common language runtime (CLR) as the object run time environment. Webservice will be compiled into immediate language (IL) format whichever language was chosen to implement. CLR just-in-time compiler will generate native code for the first loaded programs. (It is possible to pre-load the interpreted assembly into target machine instruction binary to global assembly cache though.) Object derived from MarshalByRefObject is deemed as remote objects. The communication is supported in XML, SOAP or binary encoding, through HTTP, TCP and SMTP protocol channel. In summary, .Net also has an explicit set of Web Service, based on HTTP (communication protocol) + XML (data format) + SOAP (format to request service) + WSDL (format for defining service) + UDDI (protocol for discovering service). I hope you can see the difference between Remoting and .Net Web Service, as the latter is a fixed combination and higher functional level.

No comments: