Hi Guys!

This Post is to introduce you to Hadoop and Map Reduce. In our Previous Post we  have discussed about What is Big Data? What are its types ? and the reasons to learn Hadoop.

Lets see how Hadoop came into existence.

                                                       History of Hadoop

  • Even though the word “Hadoop” may be new to you, but it is already 10 years old now.
  • As every one has a history, our Hadoop also have a huge history.
  • The Story begins on a sunny afternoon in the year 1997, Doug Cutting, a Yahoo! Employee started writing the first version of Lucene

What is Lucene ?

  • Lucene is a text search library designed by Doug Cutting. This library was used for the faster search of web pages.
  • But after years, he experienced “Dead Code Syndrome”, so for better solutions he open sourced to Source Forge.
  • In 2001, it was made Apache Lucene and then focused on indexing the web pages.
  • Mike Cafarella, a graduate from University of Washington joined him to index the entire web.
  • This combined effort yielded a new Lucene sub-project called as Apache Nutch
  • An important algorithm, that’s used to rank web pages by their relative importance, is called PageRank, after Larry Page, who came up with it
  • It’s really a simple and brilliant algorithm, which basically counts how many links from other pages on the web point to a page. The page that has the highest count is ranked the highest (shown on top of search results). Of course, that’s not the only method of determining page importance, but it’s certainly the most relevant one.

Origin of HDFS

  • During this course of time, Cutting and Cafarella were facing four different issues with the existing file system

[1] Schema less (no tables and columns)

[2] Durable (once written should never lost)

[3] Capability of handling component failure ( CPU, Memory, Network)

[4] Automatically re-balanced (disk space consumption)

                                                                            Google’s Solution 

  • In 2003, Google published GFS Paper. Cutting and Cafarella were astonished to see solutions for the difficulties they were facing during this time.
  • Therefore, using this GFS Paper and integrating Java , they developed their own File System called as NDFS( Nutch Distributed File System).
  • But the problem of durability and fault tolerance was still not solved.
  • Thus, they came up with an idea of distributed processing and divided the file system into 64mb chunks and storing each element on 3 different nodes(replication factor) & set it default to 3.

Time for Map Reduce

  • Now, these guys needed an algorithm for the NDFS cause they want to integrate parallel processing nothing but running multi nodes at same time.
  • Thus in the year 2004, Google published a paper called as Map Reduce – Simple Data Processing on Large Clusters.
  • This algorithm have solved problems like

[1] Parallelization

[2] Distribution

[3] Fault-tolerance

Rise of Hadoop

  • In the year 2005, Cutting reported that Map Reduce is integrated into Nutch
  • In the year 2006, he pulled out the GFS, Map Reduce out of Nutch Codebase and named it Hadoop.
  • Hadoop included Hadoop Common(Core Libraries), HDFS and Map Reduce
  • Later Yahoo was facing the same problems and they employed Cutting again to transform their file system to Hadoop which saved Yahoo! actually.

Facebook, Twitter, LinkedIn…

  • Later, Companies like Facebook, Twitter, LinkedIn started using Hadoop
  • In the year 2008, Hadoop was still a sub project of Lucene. So, Cutting made it a separate project and licensed it under Apache Software Foundation.
  • Different Companies started noticing problem with their File System and started experimenting with Hadoop and were creating sub-projects like Hive, PIG, HBase, Zookeeper.

This is all about how Hadoop and Map Reduce came into existence. People say Hadoop is a new technology but it is already 10 years old.

Best Regards,

Ajay Kumar Jogawath

Research and Development Engineer

Big Data Evangelist