Apache Spark is a general-purpose lightening fast data processing engine, suitable for use in a wide range of circumstances. Spark leverages the hadoop’s strength for cluster management and data persistence and compliance.
Spark was developed in 2009 in UC Berkeley’s AMPLab and open sourced in 2010, Apache Spark. According to stats on Apache.org, Spark can “run programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.”
In this blog post, Lets talk about How Spark is complementing Hadoop. Although Spark is a viable alternative to Hadoop MapReduce in many circumstances, it is not a replacement for Hadoop.
Spark has been designed to run on top of Hadoop and is an alternative to the traditional batch map/reduce model, leveraging Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). Spark can also run completely separately outside Hadoop, integrating with alternative cluster managers like Mesos and alternative storage platforms such as Cassandra and Amazon S3.
MapReduce is a programming model. In Hadoop MapReduce, the MapReduce reads data from the disk and then write the data back to the disk for each event. Spark increased performance by tenfold because it didn’t have to store the data back to the disk, all activities are done in memory. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.
From the Spark academic paper: “RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.” This removes the need for replication to achieve fault tolerance.
Spark is an independent project. It has become another data processing engine in Hadoop ecosystem adding more capability to Hadoop stack. Plus, Spark permits programmers and developers to write applications in Java, Python or Scala and to build parallel applications designed to take full and fast advantage of a distributed environment.
Spark complements Hadoop by adding:
- Iterative Algorithms in Machine Learning
- Interactive Data Mining and Data Processing
- Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive
- Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
In certain circumstances, Spark’s SQL capabilities and streaming capabilities, and graph processing capabilities may also prove to be of value.
In this blog post, I discussed how Spark adds value to Hadoop and sign boards point to Spark becoming a significant component within Hadoop.