Day 3

What is the benefit of distributed?

Using parallel concept, original task might complete in 11 hour, but if parallel in 4 machine, it would took only 3 hour.

Challenges

Division of data (got 4 machine but data is divided into 3)
Distribution of data (2TB to M1, 1TB to M2, 500GB to M3, 500GB to M4)
Combining and consolidating result of splits
Parallel Computing
Costly architecture

Hadoop

Core Principle

bring code to data rather than data to code
designed with an ability to cope up with node failures
scale out storage – add more nodes/machines to an existing distributed application (software layer is designed for node additions/removals)

Hadoop Components

Storage
– Hadoop distributed file system (HDFS)
– responsible for storing data on clusters
– data are split into blocks and distributed across multiple nodes in the cluster
– each block is replicated multiple times
Processing
– Map Reduce (MR)
– map reduce is the system used to process the data in hadoop cluster
– consists of 2 phases: map and reduce

Why Hadoop? (feature)

ability to store big data
compute power to process big data
fault tolerance (protected against hardware failure, replication)
flexibility (unlike traditional database, you don’t need to pre-process the data – no need to validate the data before storing it), good for fast write; traditional database good for fast read
cost effective (low setup code, open source)
scalability (add node whenever require to accommodate more data)

Hadoop Definition

Hadoop is an open-source software framework (LICENSE) for distributed storage and distributed parallel processing (HOW) of big data (SIZE) using cluster of commodity hardware (COST)

Hadoop Ecosystem

_{^{source. https://www.youtube.com/watch?v=1WY63n2XRLM}}

Component	Description	Counterpart
Flume	Pipeline for un-structure data, log collector/controller	Kafka, talend, spark streaming
Sqoop	Pipeline for structure data ([sq]l database had[oop])

ETL Tool, to bring data in/out, bi-direction

Component	Description	Counterpart
Yarn Map Reduce	Scheduler, process big data/large dataset (old version is Yarn + map reduce, now combine)
Pig	Created by Yahoo, it is a scripting language, execute line by line
Mahout	Machine learning processing
R Connector	Integrate with Mahout to do statistical processing
Hive	Created by Facebook, convert HQL to Map Reduce job, high latency

Processing Tool

Component	Description	Counterpart
Hbase	NoSQL database	Cassandra, MongoDB, DynamoDB

Database (logical storage)

There are 2 type of storage (Physical & Logical). For example MSSQL store actual data into .mdf file that located at D drive. The data is store inside the .mdf file whereas physical meant the .mdf file is actually stored inside D drive (harddisc). Or even layman, video is a sequence of image (That’s why frame per second (fps)). 1 picture is stored in video, but the video is actually stored into harddisc.

Component	Description	Counterpart
Zookeeper	Coordination and synchronization tool
Oozie	Scheduler job workflow mechanism
Ambari	Maintain and monitor cluster, can create node	Cloudera Manager

Administration Tool

Problem in Big Data Analytics

– hardware failure overcome by (HDFS)
– combining data, often forgotten path after distributed overcome by (Map Reduce)

Tutorial 2

tutorial-2-hadoop-basics-sample-response

ps. edit for error in forms
Ans 5. Protection against hardware failure. (It replicates the data in 3 nodes and hence if any node goes failure, the jobs running in that node will be automatically redirected to one of its replica)
Ans 6. Since the data is replicated in Hadoop, does it mean that any
calculation done on one node will also be replicated on the other two?
No. Calculation is done on only one node and not in its replica.

Tutorial 3

Read the article given on characteristics of big data and explore the new characteristics identified to handle the big data efficiently with suitable justifications. https://www.researchgate.net/publication/315867458_A_study_of_big_data_characteristics
Tabulate your findings against each of the new characteristics..

Example Answer
Based on the given research, there are 3 new big data characteristics suggested which are Verbosity, Voluntariness, and Versatility.
In my opinion, the Verbosity is actually have similar concept with Value, Vagueness and Veracity in terms of the quality measurement of the data. The meaning of Verbosity is redundancy of the information available in different source which will lead to less usefulness of the data (Value), unclear (Vagueness) and inconsistency, inaccuracy (Veracity).
Meanwhile, Voluntariness can be considered as new big data characteristic as it means the presence of the data whenever its needed in an organization.
Similarly the Versatility also can be deliberated as new characteristic as the purpose of the data might be different depending on the context it is used.

My Try
The article suggested 3 new characteristics (Verbosity, Voluntariness, Versatility) to be consider as big data effectiveness.

About verbosity characteristic that article highlight, it is about redundancy of information (wrong, out of data or incomplete). I would raise different opinion as in the world of information free flow, it is hard to identify correct or wrong information, nor to say incomplete of data. Until the data is being process, it is unwise to tell the data itself is correct or wrong. I would rather refer back to 1) Volume as article mention it purpose is just to save storage and processing; 2) Veracity, performed after data capture; 3) Validity, method to process data result correctness; 4) volatility, define how long is long enough to justify data drift causes bad analysis. Apart from that log analysis does rely on bad data to find anomaly.

For voluntariness, I would like to agree as nowadays the law and regulation has already show high concern on how data should be collected, Act like Personal Data Protection Act (PDPA) penalize any entities that violate it. Although data collector has the capability to collect the data and make the data available to public, or to be used according to specific context, it must go through a highly regulated process. We had already saw the backfire of free data. Refer to the documentary of “The Great Hack”, it shows how big data could psychologically affect people to make bias decision.

Lastly, for versatility, I would agree on it as we will never know what data will react based on the context being perform. Refer to “Weather-Induced Mood, Institutional Investors, and Stock Returns”, weather condition has show’s significant correlation to stock price. Who know maybe Netflix watch history could correlate with health issue?

Kim 2 ML

Day 3

What is the benefit of distributed?

Hadoop

Hadoop Components

Hadoop Definition

Hadoop Ecosystem

Problem in Big Data Analytics

Tutorial 2

Tutorial 3

Leave a comment Cancel reply

Day 3

What is the benefit of distributed?

Hadoop

Hadoop Components

Hadoop Definition

Hadoop Ecosystem

Problem in Big Data Analytics

Tutorial 2

Tutorial 3

Share this:

Leave a comment Cancel reply