<< Go back to Posts
Warning : This document is still a draft

DRAFT - About big data and distributed systems

Short introduction.



Thesis: Matthieu Durut - Good introduction

Big data

Two options:

  • big machines + parallel + many CPU
  • small heterogeneous machine + distributed.

Distributed Algorithms

Map-Reduce

Heart of Apache Hadoop

Clustering:

Two options:

  • Message passing clustering (MPC) + synchronization in several rounds
  • Ensemble clustering: each algorithms perform its clustering on a part of the whole, and send it to a central entity.

Grid computing + P2P architecture: more resilient that parallel machine. Beowulf cluster: grille de calcul homongéne dont les composant sont bon marché (Raspberry Pi)

Grid Computer

Networked computers have little knowledge of the enemble. Loosel coupled. Not necessarily identical

Supercomputer: elements very clsoe (communication overhead for grid).

Cloud Computing

Assume equivalent servers located closely from each others.

VM possible in it.

Not the ownership of the consummer. No need for consummers to invest in hardware.

Software Stack

Three layers:

  • Storage level: Data replication system
  • Execution level: How machine are organized, monitoring, restart machine.
  • DSL (Somain specific language): Language to describe what need to be computed (not how !!)

Map Reduce

Sawzall ?


Relational DataBase Management System (RDBMS), w

ACID:

Atomicity

transaction applied as if there were instananeous. If a transaction is made of n operations:

  • or 0 are done
  • or n are done

There is no “inbetween”

Consistency

Multiple definition

  • user specified invariant are never broken.
  • all update are done in the same logical time.

Isolation

Each transaction indep of the others

Durability

Permanent change

CAP theorem

A distributed system cannot achieve both at the same time

  • Availability
  • Consistency
  • Partition tolerance

Open Problem

Efficiency of cloud computing VS in-house datacenter ? (CC better).



>> You can subscribe to my mailing list here for a monthly update. <<