Big Data - B'cuz My Data is Big: October 2014

Monday, October 13, 2014

Introduction to Apache Hadoop and It's Components

Overview:

In this course, you will learn more about Apache Hadoop, its different components, design considerations behind building this platform, how this platform handle fault-tolerance, important configuration files, types of Hadoop clusters and best practises followed during implementations.

Objective:

Describe Apache Hadoop and its main components.
Recognize five main design considerations for any distributed plateform.
Identify the services responsible for HDFS and MapReduce functionality.
Identify the way Hadoop offers fault-tolerance for HDFS and MapReduce.
Identify five important Hadoop Services.
Describe the process by which Hadoop breaks the files in chunks and replicates them across the cluster.
List important port numbers.
Discuss Yet Another Resource Negotiator (YARN).
Identify the best practices to be followed while designing a Hadoop Cluster.

Do you Know

Hadoop breaks one large file into multiple chunk of 64MB and replicates them across the cluster.
Hadoop can also be used as a storage platform.
Hadoop cab handle failure of multiple machines without disrupting a running job.
You can configure your hadoop Cluster to use either MRv1.0 or MRv2.0 and also have a choice of selecting a different scheduler based on your requirements.
You can run entire Hadoop cluster on a single machine.
Hadoop clusters running on a set of physical machines perform better than on the the set of virtual machines.

What is Apache Hadoop:

Apache Hadoop is an open-source software framework that supports data-intensive distributed applicaitons.
It enables applications to work with thousands of computational independent computers and petabyte of data.
Hadoop was derived from Google's MapReduce and Google File System(GFS) papers.
It is written in Java and licensed under Apache V2 licence.
Hadoop = HDFS(Storage) + Map Reduce(Process)

Map Reduce: Distributed Programming Framework.

HDFS: Hadoop Distributed File System

Hadoop - Where does it sit:

Hadoop is Java bases framework for large processing distributed across thousands of nodes.

Hadoop sits right on the top of native operating systems like Linux but it also possible on Microsoft platform using hooten distributed system.

It consist of two main component namely, MapReduces which manages the processing requirement and HDFS (Hadoop Distributed File System), it handle storage requirement.

Hadoop Design Consideration

A distributed platform for data intensive processing.
Platform which can scale linearly.
Security is not a prime consideration.
Built-in Fault-Tolerance for storage and processing
Availability for sufficient local storage
Write once Read Many (WORM)
Large streaming Reads
Hadoop is Rack-Aware
Nodes are bound to fail.

Distributed Platform Capabilities

Hadoop is designed considering Distributed requirement in mind. Different Type of Capabilities are: Automatic, Graceful, Recoverable and consistent.

Automatic: No manual intervention is required in the system and a failed task is automatically reschedules.
Transparent: Failed task automatically moved to another node.
Graceful: Failure of one or more task does not impact the whole job.
Recoverable: Node can re-join the cluster once recovered from the failure.
Consistent: Data Corruption or node failure does not result in inconsistent output.

Hadoop Component

Hadoop run 5 main important services on MapReduce 1:

NameNode,
Secondary NameNode,
Job Tracker,
Task Tracker and
Data Node.

NameNode, Secondary NameNode and Data Node are part of HDFS component. Job Tracker and Task Tracker services are part of MapReduce component.

NameNode and Job Tracker services usually run on the same node but it depends upon the cluster size. Task Tracker and DataName always run on the same node.

Secondary NameNode usually run on the separate node.

NameNode:

Holds the Metadata information, such as filename, owner, permissions, number of blocks, block location and so on.
Each file require ~ 100-200 bytes of metadata information.
"fsimage" file is read whenever NameNode daemon starts.
Changes in metadata are first written to main memory and then to the "edits" log file.
Edits are applied to fsimage on a regular basis file.
Machine with more memory and RAID-1 for OS disks and image disk(i.e. fsimage)
Single point of Failure (SPOF)
Daemon: hadoop hdfs namenode.

Secondary NameNode:

Applying "edits"(changes) can be an extensive task. So this activity is basically off-loaded to a Secondary NameNode.
Secondary NameNode is not a high available or a Fail-Over NameNode.
It runs on separate machine. However, for clusters with <20 nodes, it may run on the same node as NameNode.
Daemon: Hadoop-hdfs-secondarynamenode.

Recommendation:

Machine configuration should be same as NameNode, so that in case of an emergency it can be used as a NameNode.

DataNode:

Holds the actual data blocks
The Replication factor decides the number of copies required for a data block.
Default Block size: 64MB(Recommended: 128 MB for a moderate workload).
When a block is written, DataNode also computes the checksum and store it for comparision during read operation.
Verificaiton: Every 3-weeks(default) after the block was created.
DataNode send heartbeat to NameNode every 3 second.
Deamon: Hadoop-hdfs-datanode.

Job Tracker:

It runs on the separate machine. However, for clusters with <20 nodes, it may run on the same node as NameNode.
Scheduler(FIFO, Fair Scheduler)
Assign tasks to different data node and track the progress.
Keeps track of task failures and re-directs to other nodes.
Daemon: hadoop-0.20 mapreduce-jobtracker.

Task Tracker:

Runs on Every DataNode
It can run either Mapper Task or Reducer Task
Task process sends heartbeats to TaskManager. A gap of 10 minutes forces TaskTracker to kill the JVM and notify the JobTracker.
JobTracker re-run the tasks on another DataNode. Any task failing 4-times lead to Job Failure.
Any TaskTracker reporting multiple failures is blacklisted.
Deamon: hadoop-0.20 mapreduce-tasktracker.

Web Interface Ports for Users:

HDFC:

NameNode: 50070

DataNode: 50075

Secondary NameNode: 50090

Backup/Checkpoint Node: 50105

MapReduce:

JobTracker: 50030, TaskTracker: 50060

Hadoop Cluster - Data Upload

Data is uploaded in chunks of 64 MB by default and it get replicated across multiple data nodes ( min 3 defined by the replication factor).

User try to upload a file of size 150 MB, which get broken into three parts, two parts of 64MB each and remaining part of 22 MB.

The first two parts of 64MB can be randomly put on any machine.

Hadoop - Fault Tolerance

You will see how HDFS handles data corruption or DataNode failure.

Data uploaded by the user get replicated across three machines.

Red block replicated on DataNodes 2, 5 and 7 . Now if DataNode 2 fails, the job continue to run since there are two more Node ( DataNode 5 and DataNode-7) where data still exists.

DataNode 2 also hold copy of purple block whic is also available on DataNode 3 and DataNode 6.

How are Job and Tasks related:

A smallest unit of work is a Task.
A set of Tasks make up a Job.
A Task may be Mapper or a Reducer.
A Job consists of a mapper, reducer, and input data.
For example: 150 MB file will have 3 blocks( Split size)
The number of Mapper tasks are driven by the input split size and so we will require three Mappper tasks.
Each mapper task reads and processes one record at a time.
The number of Reducers are defined by the developer.

MapReduce - Fault Tolerance

How MapReduce provides Fault Tolerance.

User submit a job which require 3 Map Tasks. Now when the Job is running, if for some reason DataNode-2 fails which hold the purple block, the TaskTracker on this Data Node is notified back to the JobTracker about this failure.

The JobTracker connects back with the NameNode to figure out which of the DataNode holds the purple block copy.

NameNode get back with DataNode-6 as alternate DataNode. This information helps JobTracker to re-launch the task on DataNode-6.

This entire process works without any user intervention.

What's New in MRv2.0?

MapReduce - v1 was not desinged with job tracker scalability in mind.

Besides there were no high availability feature available for job tracker service which made it a single point of failure.

It shows newer Architecture MapReduce V2.

The JobTracker and TaskTracker services are replaced with Resource manager and NodeManager Servives respectively.

How does this work:

For every job which get submitted on the cluster one of the node manager is designated as temporary application master which manages life cycle of that particular job in terms of resource requirement.

Once the job is complete the role is taken back.

Hadoop Configuration File

There are three configuraiotn files which control the behavior of the cluster:

1. core-site.xml: Holds all parameters that are applicable cluster wide. For example, namenode details.

2. hdfs-site.xml: Parameter define in this file control behavior of HDFS. Holds all parameters that affect the HDFS functionality like replication factor, permission, folder holding critical information and so on.

3. mapred-site.xml: Parameter define in this file control the MapRedude behavior. Holds all parameters that affect the MapReduce functionality like jobtracker details, folders holding critical information, maximum mappers allowed, maximum reducer allowed and so on.

Default location of these files is: /etc/hadoop

Hadoop Cluster Deployement Modes

There are three modes in which a hadoop cluster can be configured.

Local Job Runner Mode:

No deamon run
All Hadoop component run with single JVM
No HDFS
Good Enough for testing purpose.

Pseudo-Distributed Mode:

All daemon run on a local machine but each runs in its own JVM.
Use HDFS for data storage
Useful to simulate a Hadoop Cluster before deploying program on an actual cluster

Fully Distributed Mode:

Each Hadoop component runs on a separate machine
HDFS used to distribute the data across the DataNode.

Hadoop Cluster Sizes

Small(Development/Testing): Number of Nodes > 5 and <20

Medium/Moderate (Small and Medium Businesses): Number of Nodes >20 and < 250

Large (Large Enterprises/Enterprises into Analytics) Number of Nodes > 250 and < 1000

Extra Large (Scientific Research): Number of Nodes > 1000

Hadoop Cluster Type

Basic Cluster:

Namenode, Secondary NameNode, DataNode, JobTracker, Task Tracker(s).

High Available Cluster:

HA NameNode, HA JobTracker, DataNode, Task Tracker(s).

Federated Cluster:

Namenode(s), DataNode(s), JobTracker, Task Tracker(s).

Federated Cluster with HA:

Namenode(s), StandbyNameNode*(s), DataNode(s), JobTracker, Task Tracker(s).

Hadoop Cluster - Deployment Architecture

Image shows Deployed Hadoop cluster on a RACK of servers.

Data get uploaded and replicated across DataNodes.

The cluster manager machines is used to manage cluster and perform all administrative activities.

The JobTracker MapReduce_v1 owe the Resource Manager.

Map Reduce V2 machine is used as the job scheduler for this cluster.

NameNode machine holds the metadata information of the uploaded data.

All these machines can be part of Rack servers but we have shown them separately.

The user intract with the cluster by Client Node, also known as Edge Node.

Recommendation or Best Practices

Use Server-Class Hardware for NameNode (NN), secondary NameNode (SNN) and JobTracker (JT)
Use Enterprise Grade OS for NN, SNN and JT.
Use Fedora/Ubuntu/SUSE/CentOS for DataNodes
Avoid virtualization
Avoid Blade Chassis
Avoid Over-Subscription of Top-of the Rack and Core-Switches
Avoid LVM on DataNodes.
Avoid RAID implementation on DataNodes.
Hadoop has no specific file system requirement, such as, ext3/ext4.
Reduce "swappiness" on the DatNodes, such as, vm.swappiness
Enable Hyper-Threading on DataNodes.
Prefer more disks over capacity and speed
For SATA disks, ensure IDE simulation is not enabled
Disable IPv6 and SE Linux
Enable NTP among all nodes.

Friday, October 10, 2014

What is NoSQL

NoSQL is not the name of any particular database instead it refers to a broad class of non-relational databases that differ from classical relational database management systems (RDBMS) in some significant aspects, most notably because they do not use SQL as their primary query language, instead providing access by means of Application Programming Interfaces (API).

NoSQL databases and data-processing frameworks are primarily utilized because of their speed, scalability and flexibility. Adoption of NoSQL in the enterprise level, however, is still emerging. Some consider it the absolute apogee of achievement, while others maintain it at the peak of the Inflated Expectations Phase of Gartner’s Hype Cycle, used to characterize the over-enthusiasm or “hype” and subsequent disappointment that typically happen with the introduction of new technologies. Still others relegate it to an inferior and inconspicuous position in favor of columnar relational databases such as Sybase IQ or Oracle 11g.

NoSQL...can be considered "Internet age" databases that are being used by Amazon, Facebook, Google and the like to address performance and scalability requirements that cannot be met by traditional relational databases.

Features of NoSQL databases

One major difference between traditional relational databases and NoSQL is that the latter do not generally provide guarantees for atomicity, consistency, isolation and durability (commonly known as ACID property), although some support is beginning to emerge. Instead of ACID, NoSql databases more or less follow something called "BASE". We will discuss this in more detail later in the article.

ACID is comprised of a set of properties that guarantees that database transactions are processed reliably. To know more about ACID, read What is a database? A question for both pro and newbie

The other major difference is, NoSQL databases are generally schema-less - that is records in these databases do not require to conform to a pre-defined storage schema.

In a relational database, schema is the structure of a database system described in a formal language supported by the DBMS and refers how the database will be constructed and divided into database objects such as tables, fields, relationships, views, indexes, packages, procedures, functions, queues, triggers and other elements.

In NoSQL databases, schema-free collections are utilized instead so that different types and document structures such as{“color”, “blue”} and {“price”, “23.5”} can be stored within a single collection.

Below table lists down the major characteristic features of NoSQL databases

Feature	Description
Schema-less	"Tables" don't have a pre-defined schema. Records have a variable number of fields that can vary from record to record. Record contents and semantics are enforced by applications.
Shared nothing architecture	Instead of using a common storage pool (e.g., SAN), each server uses only its own local storage. This allows storage to be accessed at local disk speeds instead of network speeds, and it allows capacity to be increased by adding more nodes. Cost is also reduced since commodity hardware can be used.
Elasticity	Both storage and server capacity can be added on-the-fly by merely adding more servers. No downtime is required. When a new node is added, the database begins giving it something to do and requests to fulfill.
Sharding	Instead of viewing the storage as a monolithic space, records are partitioned into shards. Usually, a shard is small enough to be managed by a single server, though shards are usually replicated. Sharding can be automatic (e.g., an existing shard splits when it gets too big), or applications can assist in data sharding by assigning each record a partition ID.
Asynchronous replication	Compared to RAID storage (mirroring and/or striping) or synchronous replication, NoSQL databases employ asynchronous replication. This allows writes to complete more quickly since they don't depend on extra network traffic. One side effect of this strategy is that data is not immediately replicated and could be lost in certain windows. Also, locking is usually not available to protect all copies of a specific unit of data.
BASE instead of ACID	NoSQL databases emphasize performance and availability. This requires prioritizing the components of the CAP theorem (described elsewhere) that tends to make true ACID transactions implausible

Types of NoSQL databases

NoSQL database systems came into being by some of the major internet players such as Google, Facebook, LinkedIn and others which had significantly different challenges in dealing with data than those addressed by traditional RDBMS solutions. There was a need to provide information out of large volumes of data that to a greater or lesser degree adhered to similar horizontal structures. These companies realized that performance and real-time character was more important than consistency, to which much of the processing time in a traditional RDBMS had been devoted.

As such, NoSQL databases are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage. The reduced run-time flexibility compared to full SQL systems is counterbalanced by significant gains in scalability and performance for certain data models. NoSQL databases demonstrate their strengths above all with regard to the flexible handling of variable data by document-oriented databases, in the representation of relationships by graph databases and in the reduction of a database to a container with key-value pairs provided by key-value databases.

Consequently, NoSQL databases are often categorized according to the way they store data and fall under the following major categories:

Key-value stores
Columnar (or column-oriented) databases
Graph databases
Document databases

Key-value stores

Key-value stores allow the application to store its data in a schema-less (key, value) pairs. These data can be stored in a hash table like datatypes of a programming language - so that each value can be accessed by its key. Although such storage might not be very efficient - since they provide only a single way to access the values - but eliminates the need for a fixed data model.

Columnar databases

A column-oriented DBMS stores its content by column rather than by row. It contains predefined families of columns and is more accomplished at scaling and updating at relatively high speeds, which offers advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.

Graph databases

Graph databases optimize the storage of networks – or “Graphs“ – of related nodal data as a single logical unit. A graph database uses graph structures with nodes, edges and properties to represent and store data and provides index-free adjacency, meaning that every element contains a direct pointer to its adjacent element and no index lookups are necessary. This can be useful in cases of finding degrees of separation where SQL would require extremely complex queries. A popular movie service, for example, shows the logged-in user a “Best Guess for You” rating for each film based on how similar people rated it, while other services such as LinkedIn, Facebook or Netflix show people in a network at various degrees of separation. Although such queries become simple in Graph databases, the relevance of this technology in a financial enterprise is difficult to determine.

Document databases

Document stores are used for large, unstructured or semistructured records. Data is organized in documents that can contain any number of fields of any length. All document-oriented database implementations assume documents encapsulate and encode data in some sort of standard formats – known as encodings – and are ideal for MS Office or PDF documents. Document databases should not be confused with Document Management Systems, however. The documents referred to are not actual documents as such, although they can be. Documents inside a document-oriented database are similar in some ways to records or rows in relational databases, but they are less rigid because they are not required to adhere to a standard schema. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty fields in document records. This system allows new information to be added to or removed from any record without wasting space by creating empty fields on all other records. In contrast to key-value and columnar databases, which view each record as a list of attributes which are updated one at a time, document stores allow insertion, updates and queries of entire records using a JavaScript Object Notation (JSON) format. The concept of a join is less relevant in document databases than in traditional RDBMS systems. As a result, records that might be joined in a traditional RDBMS, are generally denormalized into wide records. Denormalization refers to a process by which the read-performance of a database is optimized by the addition of redundant or grouped data. Some of the NoSQL vendors, most notably MongoDB, do in fact feature add-on join capabilities as well. Many of these database categories are beginning to blur, however. As all of them support the association of values with keys, they are therefore all fundamentally key-value stores; document databases, moreover, can perform all of the capabilities of columnar databases from a sematic point of view. As a result, the distinguishing factors must be evaluated in terms of performance and ease of use for a particular solution.

Popular incarnations of NoSql databases

Most implemented solutions cannot be strictly assigned to a specific type and contain features from two or more categories. We should also recognize that each NoSQL implementation has its own special nuances. Popular offerings include the following:

Apache Cassandra

Apache Cassandra is an open-source, distributed database-management system designed to handle very large amounts of data spread out across many commodity servers while providing a high degree of service availability with no single point of failure. It is particularly fast at write operations as opposed to reads and might therefore lend itself best to applications that require analysis of large sets of data with write-backs.

HBase

HBase is also an open-source, distributed database modeled after Google’s BigTable. HBase technologies are not strictly a data-store, but generally work closely with a NoSQL database to accomplish highly scalable analyses. HBase scales linearly with the number of nodes and can quickly return queries on tables consisting of billions of rows and millions of columns.

BigTable

BigTable can be defined as a sparse, distributed, multi-dimensional sorted map. BigTable is designed to scale into the petabyte range – a petabyte is equivalent to 1 million gigabytes - across hundreds or thousands of machines and to make it easy to add more machines to the system and start taking advantage of those resources automatically without any reconfiguration.

Coherence and Ehcache

Coherence and Ehcache are equipped with In-Memory caches. Coherence is in heavy use in financial industries where network latency – defined as the time it takes to cross a network connection from sender to receiver - is a factor.

NoSQL versus relational columnar databases – Is NoSql right for you?

Relational columnar databases such as SybaseIQ continue to use a relational model and are accessed via traditional SQL. The physical storage structure is very different when compared to non-relational NoSQL columnar stores, which store data as rows whose structure may vary and are organized by the developer into families of columns according to the application use case.

Relational columnar databases, on the other hand, require a fixed schema with each column physically distinct from the others, which makes it impossible to declaratively optimize retrievals by organizing logical units or families. Because a NoSQL database retrieval can specify one or more column families while ignoring others, NoSQL databases can offer a significant advantage when performing individual row queries. NoSQL databases cannot meet the performance characteristics of relational columnar databases when it comes to retrieving aggregated results from groups of underlying records, however.

This distinction is a litmus test when deciding between NoSQL and traditional SQL databases. NoSQL databases are not as flexible and are exceptional at speedily returning individual rows from a query. Traditional SQL databases, on the other hand, forfeit some storage capacity and scalability but provide extra flexibility with a standard, more familiar SQL interface.

Since relational databases must adhere to a schema, they typically need to reserve space even for unused columns. NoSQL databases have a dense per-row schema and so tend to be better at optimizing the storage of sparse data, although the relational databases often use sophisticated storage-optimization techniques to mitigate this perceived shortcoming.

Most importantly, relational columnar databases are generally intended for the read-only access found in conjunction with data warehouses, which provide data that was loaded collectively from conventional data stores. This can be contrasted with NoSQL columnar tables, which can handle a much higher rate of updates.

The CAP Theorem

Despite the high demand in recent years for massively distributed databases with high partition fault-tolerance, the CAP theorem stipulates that it is actually impossible for a distributed system to provide consistency, availability and partition fault-tolerance guarantees simultaneously; a distributed system can satisfy at most any two of these guarantees at the same time, but not all three. These guarantees can be understood as follows:

Consistency – Concurrently executing queries see the same valid and consistent data at the same time.

Availability – This is a guarantee that every request receives a response about whether it succeeded or failed.

Partition-tolerance – Also known as fault-tolerance, this is a guarantee that the system continues to operate despite arbitrary message loss.

Because no distributed system is capable of satisfying all three guarantees at the same time, a tradeoff must be made. While traditional databases make that decision for us, NoSQL databases provide these guarantees as tuning options. Database vendors must always decide which two to prioritize. The options are as follows:

Availability is compromised in favor of consistency and partition-tolerance.

Partition-tolerance is forfeited in favor of consistency and availability.

Consistency is compromised but systems are always available and can work when parts are partitioned.

Traditional SQL databases place a high priority on consistency and fault-tolerance and have generally as a result chosen to go with the first option above and forfeit high availability. NoSQL databases frequently leave that decision to the application operations team and provide configuration options so that the preferred options can be chosen based on the application use case.

Concepts of BASE - Basically Available Soft-state Eventually

Sometimes, however, perfect consistency is not a requirement and “eventual consistency” will suffice. Consequently, many NoSQL databases are using eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency. In contrast to immediate consistency, which guarantees that updates are immediately visible to all when a update operation returns to the user with a successful result, eventual consistency means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent.

In database terminology, this is known as “Basically Available Soft-state Eventually” (BASE) consistent as opposed to the database concept of ACID. No doubt the juxtaposition of the terms ACID and BASE was more than a mere coincidence.

Apache CouchDB, for example, uses a versioning system similar to software version control systems such as Subversion (SVN). An update to a record does not overwrite the old value, but rather creates a new version of that record. If two clients are operating on the same record and client A updates the record before client B, then client B will be notified that the version being modified is out of date and will have the option to requery the revised record and make the change there in a manner similar to an “update and merge” operation in SVN.

In order to use NoSQL databases at the present time, an understanding of the API language is required and queries must be written in that language. This is, however, greatly facilitated by the fact that Java is supported in every case. Work has also been done recently to create a unified NoSQL language called Unstructured Query Language (UNQL), which is semantically a superset of SQL Data Manipulation Language (DML). There is also an Apache incubator project called Thrift which involves an interface-definition language particularly well-suited to NoSQL use cases. Thrift is reminiscent of CORBA IDL and provides a means by which language-specific interfaces can be generated for most popular languages. Originally developed at Facebook, it has been shared as an open-source project since 2007.