Unstructured and Distributed data sets are becoming a norm in the new data centric world. Petabytes of data are processed every single day in the world wide web. First, we need to convert these unstructured data with billions of transactions into knowledge. Second, we need a fast data processing model; Hadoop/MapReduce was the main platform for the last couple of years but there was a big demand for faster processing and that’s what Apache Spark brought on top of Hadoop. Third, we need to analyze big data on real time and that’s what Apache Storm brought to the architecture.
MapReduce came out of functional programming ways of thinking. In the Hadoop MapReduce framework, datasets are divided into pieces called chunks; you apply MAP function to the chunks and create intermediate key value pairs, the framework will then group the intermediate values by Key and pass them to a Reduce function invocations that create an output value. Hadoop has the concept of Job tracker, a master node that coordinates everything in the Hadoop cluster. When a client submits a job, job tracker breaks it into chunks and assign work to task trackers, and then apply map() function, reduce() function and store the output data on HDFS. But Hadoop Job tracker was a barrier for scaling and that’s what YARN, Yet another resource negotiator, provided on top of Hadoop. The YARN resource manager replaced the resource management service of the Job tracker in Hadoop. In YARN, the Application Master determines the number of Map and Reduce tasks while the resource manager schedule jobs in the node manager.
Spark, on the other hand, is significantly faster and easier to program than MapReduce. Apache Spark extends the MapReduce model to better support iterative algorithms and interactive data mining. Spark has the notion of resilient distributed datasets RDDs which means that the data stays in memory and we can do multiple iterations on the data sets. But if you’re dealing with huge amount of data that don’t fit within the RAMs you have in the cluster, then Spark will not be able to process these data and we have to go back to Hadoop. From design and architecture perspective, it’s very critical to have the Spark-Hadoop integration in place before moving the data processing to Spark.
Moreover, we’d like to process that big data within few seconds and convert these data into knowledge very quickly. Apache Storm is the solution for real time data processing. The concept of Storm is that you have tuples which are list of key value pairs, streams which are sequence of tuples, Spout is the entity that generates these tuples from the datasets, Bolts is the entity that process these data streams and topology which is a directed graph of Spout and Bolts. From architecture perspective, Storm has a master node that runs a daemon called Nimbus, a worker node that runs a daemon called supervisor, and Zookeeper that coordinates Nimbus and Supervisor communication and keep up the consistency. Nimbus instructs supervisor to run workers, worker daemons run executors and executors run user tasks. Regarding processing guarantees, Storm utilizes a tuple tree mechanism using anchoring and spout replay to provide at least one processing guarantee. If you want only one guarantee; in this case you need to know about the states of the topology and that’s provided by Trident which is built on Storm with connectors to HBase NoSQL data Store.
In summary, both accuracy and real time are important and so we need to integrate both worlds of data processing; Hadoop for batch processing of data at scale and Storm graph of Spouts and Bolts for real time processing.