What Caused the Downfall of Hadoop in Big Data Domain? – Analytics Insight

While Hadoop emerged as favorite for Big Data Technologies, it could not keep up with the hype!

Hadoop is one of the most popular open-source cloud platforms from Apache, used in big data community for data processing activities. Debuting in 2006, as Hadoop version 0.1.0, it was first developed by Doug Cutting and Mike Carafella, two software engineers that wanted to improve web indexing in 2002. It was built upon Google’s File System paper and was created as the Apache Nutch project. Since then, Hadoop has been used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.

With the rising importance of big data in industries, many business activities revolve around data.  Hadoop is great for MapReduce data analysis on huge amounts of data. Some of its specific use cases include data searching, data analysis, data reporting, large-scale indexing of files and other big data functions. It can also store and process any file data, be it large or small, plain text files or binary files like images, and even multiple data versions across different time periods. It basically stores the data using Hadoop distributed file system and processes it using the MapReduce programming model. Since it is based on cheap servers and requires less cost to store and process the data, Hadoop is a huge hit in business sector.

Hadoop has three components, viz.,

• Hadoop HDFS – Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

• Hadoop MapReduce – Hadoop MapReduce is the processing unit of Hadoop.

• Hadoop YARN – Hadoop YARN is a resource management unit of Hadoop.

Hadoop seemed highly promising prior to a decade. In 2008, Cloudera became the first dedicated Hadoop company, followed by MapR in 2009 and Hortonworks in 2011. It was a huge hit among Fortune 500 vendors who were fascinated by big data’s potential to generate a competitive advantage. However, as data analytics became mainstream, Hadoop faltered as it offered very little in the way of analytic capabilities. Further, as businesses migrated to the cloud, they soon found alternatives to the HDFS and the Hadoop processing engine.

Every cloud vendor offered their unique big data services capable of doing things that were previously only possible on Hadoop in a more efficient and hassle-free manner. Users were no longer bothered by the administration, security, and maintenance issues they faced with Hadoop. The security issues are mainly because, Hadoop is written in Java which is a widely used programming language.  Java has been heavily exploited by cybercriminals and as a result, a bull’s eye for numerous security breaches.

A 2015 study from Gartner found that 54% of companies had no plans to invest in Hadoop. The study also noticed that out of those who were not investing, 49% were still trying to figure out how to use it for value, while 57% said that the skills gap was the major reason. The latter is also another key reason behind the downfall of Hadoop. Most of the companies had jumped the bandwagon due to the hype surrounding it. Some of them did not have enough data to warrant a Hadoop rollout, or started leveraging big data technologies without estimating the amount of data they actually would need to process. While file-intensive MapReduce was a great piece of software for simple requests, it could not do much for iterative data. This is why it is a bad option for machine learning too. Machine learning functions on cyclic flow of data, in contrast Hadoop has data flowing in a chain of stages where output on one stage becomes the input of another stage. Therefore, machine learning is not possible in Hadoop unless tied with a 3rd party library.

It was also an inefficient solution for smaller datasets. In other words, while it is perfect for a small number of large files, however in case of an application dealing with a large number of small files, Hadoop fails again! This is because the large number of small files tends to overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function. It is also not suitable for non-parallel data processing.

At the same time, Cloudera and Hortonworks were witnessing lesser adoption with every year, which led to the eventual merger of the two companies in 2019.

Lastly, another major reason behind downfall of Hadoop is the fact that it’s a batch processing engine. Batch processes are one that run in the background and do not have any kind of interaction with the user. The engines used for this are not efficient when it comes to stream processing. Also, they cannot produce output in real-time with low latency – which is a must for real time data analysis.