Spark
Cluster computing framework, processing framework using memory. To overcome map reduce performance issue. Cluster manager resource allocation base on job (request). Spark create Resilient Distributed Datasets (RDD) :: partition once it receive the data. Once RDD is ready, it uses graph transformation (Directed Acyclic Graph DAG)
Consists of 2 phases: Transformation and Action
** Spark rely on memory to process; similar to restarting laptop/pc, memory is reinitialized. To be fault tolerant, data is replicate to other worker node and in lazy evaluation mode.
Spark Components

- Spark Core – control everything
- Spark SQL – sequal interface to query spark
- Spark Streaming – data ingestion (to get the data, put the data) counter part for Flume
- MLib – inbuilt machine learning library
- GraphX – to store the data in graphical format/to retrieve data in graphical form
Resilient Distributed Datasets (RDD)
Data loaded into memory (partition across nodes). Lineage of data is remembers and immutable. Can only be create new RDD with transformation/updates but not updating old RDD.
Directed Acyclic Graph (DAG)
Job is action on series of transformation operation to RDD. Actual data modification are not perform until action are submitted.


Leave a comment