5 Reasons When to and When not to Use Hadoop

Hold on! Wait a minute and think before you join the race and become a Hadoop Maniac. Hadoop has been the buzz word in the IT industry for some time now. Everyone seems to be in a rush to learn, implement and adopt Hadoop. And why should they not? The IT industry is all about change. You will not like to be left behind while others leverage Hadoop. However, just learning Hadoop is not enough. What most of the people overlook, which according to me, is the most important aspect i.e. “When to use and when not to use Hadoop”

In this blog you will understand various scenarios where using Hadoop directly is not the best choice but can be of benefit using Industry accepted ways. Also, you will understand scenarios where Hadoop should be the first choice. As your time is way too valuable for me to waste, I shall now start with the subject of discussion of this blog.

First, we will see the scenarios/situations when Hadoop should not be used directly!

When Not To Use Hadoop

# 1. Real Time Analytics

If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly. It is because Hadoop works on batch processing, hence response time is high.

The diagram below explains how processing is done using MapReduce in Hadoop.

Real Time Analytics – Industry Accepted Way

Since Hadoop cannot be used for real time analytics, people explored and developed a new way in which they can use the strength of Hadoop (HDFS) and make the processing real time. So, the industry accepted way is to store the Big Data in HDFS and mount Spark over it. By using spark the processing can be done in real time and in a flash (real quick).

For the record, Spark is said to be 100 times faster than Hadoop. Oh yes, I said 100 times faster it is not a typo. The diagram below shows the comparison between MapReduce processing and processing using Spark

MapReduce processing vs processing using Spark

I took a dataset and executed a line processing code written in Mapreduce and Spark, one by one. On keeping the metrics like size of the dataset, logic etc constant for both technologies, then below was the time taken by MapReduce and Spark respectively.

MapReduce – 14 sec
Spark – 0.6 sec

This is a good difference. However, good is not good enough. To achieve the best performance of Spark we have to take a few more measures like fine-tuning the cluster etc.

# 2. Not a Replacement for Existing Infrastructure

Hadoop is not a replacement for your existing data processing infrastructure. However, you can use Hadoop along with it.

Industry accepted way:

All the historical big data can be stored in Hadoop HDFS and it can be processed and transformed into a structured manageable data. After processing the data in Hadoop you need to send the output to relational database technologies for BI, decision support, reporting etc.

The diagram below will make this clearer to you and this is an industry-accepted way.

Word to the wise:

Hadoop is not going to replace your database, but your database isn’t likely to replace Hadoop either.
Different tools for different jobs, as simple as that.

# 3. Multiple Smaller Datasets

Hadoop framework is not recommended for small-structured datasets as you have other tools available in market which can do this work quite easily and at a fast pace than Hadoop like MS Excel, RDBMS etc. For a small data analytics, Hadoop can be costlier than other tools.

Industry Accepted Way:

We are smart people. We always find a better way. In this case, since all the small files (for example, Server daily logs ) is of the same format, structure and the processing to be done on them is same, we can merge all the small files into one big file and then finally run our MapReduce program on it.

In order to prove the above theory, we carried out a small experiment.

The diagram below explains the same:

Merging smaller files into one big file 1

We took 9 files of x mb each. Since these files were small we merged them into one big file. The entire size was 9x mb. (Pretty simple math: 9 * x mb = 9x mb )

Finally, we wrote a MapReduce code and executed it twice.

First execution (input as small files):

Input data: 9 files each of x mb each
Output: 4225284 records
Time taken: 10400 ms

Second execution (input as one big file):

Input data: 1 files each of 9x mb
Output: 4225284 records
Time taken: 6140 ms

So as you can see the second execution took lesser time than the first one. Hence, it proves the point.

# 4. Novice Hadoopers

Unless you have a better understanding of the Hadoop framework, it’s not suggested to use Hadoop for production. Hadoop is a technology which should come with a disclaimer: “Handle with care”. You should know it before you use it or else you will end up like the kid below.

Learning Hadoop and its eco-system tools and deciding which technology suits your need is again a different level of complexity

# 5. Where Security is the primary Concern?

Many enterprises — especially within highly regulated industries dealing with sensitive data — aren’t able to move as quickly as they would like towards implementing Big Data projects and Hadoop.

Industry-Accepted way

There are multiple ways to ensure that your sensitive data is secure with the elephant (Hadoop).

Encrypt your data while moving to Hadoop. You can easily write a MapReduce program using any encryption Algorithm which encrypts the data and stores it in HDFS.

Finally, you use the data for further MapReduce processing to get relevant insights.

The other way that I know and have used is using Apache Accumulo on top of Hadoop. Apache Accumulo is sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Ref: https://accumulo.apache.org/

What they missed to mention in the definition that it implements a security mechanism known as cell-level security and hence it emerges as a good option where security is a concern.

When To Use Hadoop

# 1. Data Size and Data Diversity

When you are dealing with huge volumes of data coming from various sources and in a variety of formats then you can say that you are dealing with Big Data. In this case, Hadoop is the right technology for you.

# 2. Future Planning

It is all about getting ready for challenges you may face in future. If you anticipate Hadoop as a future need then you should plan accordingly. To implement Hadoop on you data you should first understand the level of complexity of data and the rate with which it is going to grow. So, you need a cluster planning. It may begin with building a small or medium cluster in your industry as per data (in GBs or few TBs ) available at present and scale up your cluster in future depending on the growth of your data.

# 3. Multiple Frameworks for Big Data

There are various tools for various purposes. Hadoop can be integrated with multiple analytic tools to get the best out of it, like Mahout for Machine-Learning, R and Python for Analytics and visualization, Python, Spark for real time processing, MongoDB and Hbase for Nosql database, Pentaho for BI etc.

I will not be showing the integration in this blog but will show them in the Hadoop Integration series. I am already excited about it and I hope you feel the same.

# 4. Lifetime Data Availability

When you want your data to be live and running forever, it can be achieved using Hadoop’s scalability. There is no limit to the size of cluster that you can have. You can increase the size anytime as per your need by adding datanodes to it with minimal cost.

The bottom line is use the right technology as per your need.

Source: http://www.edureka.co/blog/5+Reasons-when-to-use-and-not-to-use-hadoop/

Big Data - B'cuz My Data is Big

Monday, January 5, 2015