Home Picture

Database Management: Hadoop Ecosystem

08 Aug 2020 |

Categories: Database

Hadoop Ecosystem

Hadoop

When we talk about big data, we often think of Hadoop. Why? Hadoop is a system for storing data, processing data and managing data. It can ingest data from different platform or resources. In the big data world, sometimes, the traditional DBMSs could not meet their reporting requirements. Companies may require to shorten the time to process a significant amount of data. Increasing more storage or computer power could be another benefit using Hadoop.

Hadoop consists of three basic parts:

Hadoop implements cluster management in the system. It stores data and process data in clusters. Each cluster contains a master node and worker nodes. The master node runs the Apache Yarn as Resource Manager. The worker node runs Apache Yarn as Node Manager to getting the information of CPU, memory, disk and network usage to the Resource Manager. The Resource Manager decides where to direct the new tasks based on the current workloads reported from Node Manager with the Scheduler and ApplicationsManager.

Hadoop Data are:

The example of hadoop data are web logs, sales transactions, event records, sensor data…etc


The Hadoop Philosophy

This is the anti-pattern of relational modeling!


How Does Hadoop Differ From Relational?

Relational Hadoop
  • 1. Schema on write; need a table
  • 2. Fast reads
  • 3. Highly structured data
  • 4. Declarative data processing(SQL)
  • 5. Good for ACID transactions, business data
  • 1. Schema on read; schema applied when data are read
  • 2. Fast writes
  • 3. Loosely structured data; write the data “as they are” to HDFS
  • 4. Declarative and procedural data processing
  • 5. Good for logs, data streams, unstructured data discovery

Note: ACID (Atomicity, Consistency, Isolation, Durability)


MapReduce

MapReduce is a processing technique and a program model for distributed computing based on java. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

MapReduce process


HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS is reponsible for:


Hadoop Ecosystem in Action

Hadoop Ecosystem in Action


Options for Data Ingestion

Top