Starting off as a muggle that naïve to the Math's and Data Science world.

Day 3

What is the benefit of distributed?

Using parallel concept, original task might complete in 11 hour, but if parallel in 4 machine, it would took only 3 hour.

Challenges

  • Division of data (got 4 machine but data is divided into 3)
  • Distribution of data (2TB to M1, 1TB to M2, 500GB to M3, 500GB to M4)
  • Combining and consolidating result of splits
  • Parallel Computing
  • Costly architecture

Hadoop

Core Principle

  • bring code to data rather than data to code
  • designed with an ability to cope up with node failures
  • scale out storage – add more nodes/machines to an existing distributed application (software layer is designed for node additions/removals)

Hadoop Components

  1. Storage
    – Hadoop distributed file system (HDFS)
    – responsible for storing data on clusters
    – data are split into blocks and distributed across multiple nodes in the cluster
    – each block is replicated multiple times
  2. Processing
    – Map Reduce (MR)
    – map reduce is the system used to process the data in hadoop cluster
    – consists of 2 phases: map and reduce

Why Hadoop? (feature)

  • ability to store big data
  • compute power to process big data
  • fault tolerance (protected against hardware failure, replication)
  • flexibility (unlike traditional database, you don’t need to pre-process the data – no need to validate the data before storing it), good for fast write; traditional database good for fast read
  • cost effective (low setup code, open source)
  • scalability (add node whenever require to accommodate more data)

Hadoop Definition

Hadoop is an open-source software framework (LICENSE) for distributed storage and distributed parallel processing (HOW) of big data (SIZE) using cluster of commodity hardware (COST)

Hadoop Ecosystem

source. https://www.youtube.com/watch?v=1WY63n2XRLM
ComponentDescriptionCounterpart
FlumePipeline for un-structure data, log collector/controllerKafka, talend, spark streaming
SqoopPipeline for structure data ([sq]l database had[oop])
ETL Tool, to bring data in/out, bi-direction
ComponentDescriptionCounterpart
Yarn Map ReduceScheduler, process big data/large dataset (old version is Yarn + map reduce, now combine)
PigCreated by Yahoo, it is a scripting language, execute line by line
MahoutMachine learning processing
R ConnectorIntegrate with Mahout to do statistical processing
HiveCreated by Facebook, convert HQL to Map Reduce job, high latency
Processing Tool
ComponentDescriptionCounterpart
HbaseNoSQL databaseCassandra, MongoDB, DynamoDB
Database (logical storage)

There are 2 type of storage (Physical & Logical). For example MSSQL store actual data into .mdf file that located at D drive. The data is store inside the .mdf file whereas physical meant the .mdf file is actually stored inside D drive (harddisc). Or even layman, video is a sequence of image (That’s why frame per second (fps)). 1 picture is stored in video, but the video is actually stored into harddisc.

ComponentDescriptionCounterpart
ZookeeperCoordination and synchronization tool
OozieScheduler job workflow mechanism
AmbariMaintain and monitor cluster, can create nodeCloudera Manager
Administration Tool

Problem in Big Data Analytics

  • – hardware failure overcome by (HDFS)
  • – combining data, often forgotten path after distributed overcome by (Map Reduce)

Tutorial 2

ps. edit for error in forms
Ans 5. Protection against hardware failure. (It replicates the data in 3 nodes and hence if any node goes failure, the jobs running in that node will be automatically redirected to one of its replica)
Ans 6. Since the data is replicated in Hadoop, does it mean that any
calculation done on one node will also be replicated on the other two?
No. Calculation is done on only one node and not in its replica.

Tutorial 3

Read the article given on characteristics of big data and explore the new characteristics identified to handle the big data efficiently with suitable justifications. https://www.researchgate.net/publication/315867458_A_study_of_big_data_characteristics
Tabulate your findings against each of the new characteristics..

Example Answer
Based on the given research, there are 3 new big data characteristics suggested which are Verbosity, Voluntariness, and Versatility. 
In my opinion, the Verbosity is actually have similar concept with Value, Vagueness and Veracity in terms of the quality measurement of the data. The meaning of Verbosity is redundancy of the information available in different source which will lead to less usefulness of the data (Value), unclear (Vagueness) and inconsistency, inaccuracy (Veracity).
Meanwhile, Voluntariness can be considered as new big data characteristic as it means the presence of the data whenever its needed in an organization.
Similarly the Versatility also can be deliberated as new characteristic as the purpose of the data might be different depending on the context it is used.

My Try
The article suggested 3 new characteristics (Verbosity, Voluntariness, Versatility) to be consider as big data effectiveness.

About verbosity characteristic that article highlight, it is about redundancy of information (wrong, out of data or incomplete). I would raise different opinion as in the world of information free flow, it is hard to identify correct or wrong information, nor to say incomplete of data. Until the data is being process, it is unwise to tell the data itself is correct or wrong. I would rather refer back to 1) Volume as article mention it purpose is just to save storage and processing; 2) Veracity, performed after data capture; 3) Validity, method to process data result correctness; 4) volatility, define how long is long enough to justify data drift causes bad analysis. Apart from that log analysis does rely on bad data to find anomaly.

For voluntariness, I would like to agree as nowadays the law and regulation has already show high concern on how data should be collected, Act like Personal Data Protection Act (PDPA) penalize any entities that violate it. Although data collector has the capability to collect the data and make the data available to public, or to be used according to specific context, it must go through a highly regulated process. We had already saw the backfire of free data. Refer to the documentary of “The Great Hack”, it shows how big data could psychologically affect people to make bias decision.

Lastly, for versatility, I would agree on it as we will never know what data will react based on the context being perform. Refer to “Weather-Induced Mood, Institutional Investors, and Stock Returns”, weather condition has show’s significant correlation to stock price. Who know maybe Netflix watch history could correlate with health issue?

Leave a comment