Day 8

Spark

Cluster computing framework, processing framework using memory. To overcome map reduce performance issue. Cluster manager resource allocation base on job (request). Spark create Resilient Distributed Datasets (RDD) :: partition once it receive the data. Once RDD is ready, it uses graph transformation (Directed Acyclic Graph DAG)

Consists of 2 phases: Transformation and Action

** Spark rely on memory to process; similar to restarting laptop/pc, memory is reinitialized. To be fault tolerant, data is replicate to other worker node and in lazy evaluation mode.

Spark Components

_{^{source. https://www.analyticsvidhya.com/blog/2020/11/data-engineering-for-beginners-get-acquainted-with-the-spark-architecture/}}

Spark Core – control everything
Spark SQL – sequal interface to query spark
Spark Streaming – data ingestion (to get the data, put the data) counter part for Flume
MLib – inbuilt machine learning library
GraphX – to store the data in graphical format/to retrieve data in graphical form

Resilient Distributed Datasets (RDD)

Data loaded into memory (partition across nodes). Lineage of data is remembers and immutable. Can only be create new RDD with transformation/updates but not updating old RDD.

Directed Acyclic Graph (DAG)

Job is action on series of transformation operation to RDD. Actual data modification are not perform until action are submitted.

source. https://www.databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

Tutorial 7

bdat-hbase-tutorial-no.-1-sample-response-1

Tutorial 8

bdat-hbase-no.-2-sample-response-1

Kim Ng

Kim 2 ML

Day 8

Spark

Spark Components

Resilient Distributed Datasets (RDD)

Directed Acyclic Graph (DAG)

Tutorial 7

Tutorial 8

Leave a comment Cancel reply

Day 8

Spark

Spark Components

Resilient Distributed Datasets (RDD)

Directed Acyclic Graph (DAG)

Tutorial 7

Tutorial 8

Share this:

Leave a comment Cancel reply