Apache Spark

From Christoph's Personal Wiki
Revision as of 19:20, 23 March 2017 by Christoph (Talk | contribs) (Created page with "'''Apache Spark''' is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

What is Spark?
  • Fast and general engine for large-scale data processing and analysis
  • Parallel distributed processing on commodity hardware
  • Easy to use
  • A comprehensive, unified framework for Big Data analytics
  • Open source and a top-level Apache project
Spark use cases
  • Big Data use case like intrusion detection, product recommendations, estimating financial risks, detecting genomic associations to a disease, etc. require analysis on large-scale data.
  • In depth analysis requires a combination of tools like SQL, statistics, and machine learning to gain meaningful insights from data.
  • Historical choices like R and Octave operate only on single node machines and are, therefore, not suitable for large volume data sets.
  • Spark allows rich programming APIs like SQL, machine learning, graph processing, etc. to run on clusters of computers to achieve large-scale data processing and analysis.
Spark and distributed processing
  • Challenges of distributed processing:
    • Distributed programming is much more complex than single node programming
    • Data must be partitioned across servers, increasing the latency if data has to be shared between servers over the network
    • Chances of failure increases with the increase in the number of servers
  • Spark makes distributed processing easy:
    • Provides a distributed and parallel processing framework
    • Provides scalability
    • Provides fault-tolerance
    • Provides a programming paradigm that makes it easy to write code in a parallel manner
Spark and its speed
  • Lightning fast speeds due to in-memory caching and a DAG-based processing engine.
  • 100 times faster than Hadoop's MapReduce for in-memory computations and 10 time faster for on-disk.
  • Well suited for iterative algorithms in machine learning
  • Fast, real-time response to user queries on large in-memory data sets.
  • Low latency data analysis applied to processing live data streams
Spark is easy to use
  • General purpose programming model using expressive languages like Scala, Python, and Java.
  • Existing libraries and API makes it easy to write programs combining batch, streaming, interactive machine learning and complex queries in a single application.
  • An interactive shell is available for Python and Scala
  • Built for performance and reliability, written in Scala and runs on top of JVM.
  • Operational and debugging tools from the Java stack are available for programmers.
Spark components
  • Spark SQL
  • Spark Streaming
  • MLib (machine learning)
  • GraphX (graph)
  • Spark R
Spark is a comprehensive unified framework for Big Data analytics
  • Collapses the data science pipeline by allowing pre-processing of data to model evaluation in one single system.
  • Spark provides an API for data munging, Extract, transform, load (ETL), machine learning, graph processing, streaming, interactive, and batch processing. Can replace several SQL, streaming, and complex analytics systems with one unified environment.
  • Simplifies application development, deployment, and maintenance.
  • Strong integration with a variety of tools in the Hadoop ecosystem.
  • Can read and write to different data formats and data sources, including HDFS, Cassandra, S3, and HBase.
Spark is not a data storage system
  • Spark is not a data store but is versatile in reading from and writing to a variety of data sources.
  • Can access traditional business intelligence (BI) tools using a server mode that provides standard Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC).
  • The DataFrame API provides a pluggable mechanism to access structured data using Spark SQL.
  • Its API provides tight optimization integration, thereby enhances the speed of the Spark jobs that process vast amounts of data.
History of Spark
  • Originated as a research project in 2009 at UC Berkeley AMPLab.
  • Motivated by MapReduce and the need to apply machine learning in a scalable fashion.
  • Open sourced in 2010 and transferred to Apache in 2013.
  • A top-level Apache project, as of 2017.
  • "Apache Spark is the Taylor Swift of big data software. The open source technology has been around and popular for a few years. But 2015 was the year Spark went from an ascendant technology to a bona fide superstar".[1]

Spark use cases

  • Fraud detection: Spark streaming an machine learning applied to prevent fraud
  • Network intrusion detection: Machine learning applied to detect cyber hacks
  • Customer segmentation and personalization: Spark SQL and machine learning applied to maximize customer lifetime value
  • Social media sentiment analysis: Spark streaming, Spark SQL, and Stanford's CoreNLP wrapper helps achieve sentiment analysis
  • Real-time ad targeting: Spark used to maximize Online ad revenues
  • Predictive healthcare: Spark used to optimize healthcare costs
Example - Spark at Uber
  • Business Problem: A simple problem of getting people around a city with an army of more than 100,000 drivers and to use data to intelligently perfect the business in an automated and real-time fashion.
  • Requirements:
    • Accurately pay drivers per trips in the dataset
    • Maximize profits by positioning vehicles optimally
    • Help drivers avoid accidents
    • Calculate surge pricing
  • Solution: Use Spark Streaming and Spark SQL as the ELT system and Spark MLlib and GraphX for advanced analytics
  • Reference: Talk by Uber engineers at Apache Spark meetup: https://youtu.be/zKbds9ZPjLE
Example - Spark at Netflix
  • Business Problem: A video streaming service with emphasis on data quality, agility, and availability. Using analytics to help users discover films and TV shows that they like is key to Netflix's success.
  • Requirements:
    • Streaming applications are long-running tasks that need to be resilient in Cloud deployments.
    • Optimize content buying
    • Renowned personalization algorithms
  • Solution: Use Spark Streaming in AWS and Spark GraphX for recommendation system.
  • Reference: Talk by Netflix engineers at Apache Spark meetup: https://youtu.be/gqgPtcDmLGs
Example - Spark at Pinterest
  • Business Problem: Provide a recommendation and visual bookmarking tool that lets users discover, save, and share ideas. Also get an immediate view of Pinterest engagement activity with high-thoughput and minimal latency.
  • Requirements:
    • Real-time analytics to process user's activity.
    • Process petabytes of data to provide recommendations and personalization.
    • Apply sophisticated deep learning techniques to a Pin image in order to suggest related Pins.
  • Solution: Use Spark Streaming, Spark SQL, MemSSQL's Spark connector for real-time analytics, and Spark MLlib for machine learning use cases.
  • Reference: https://medium.com/@Pinterest_Engineering/real-time-analytics-at-pinterest-1ef11fdb1099#.y1cdhb9c3

See also

References

  1. Survey shows huge popularity spike for Apache Spark. Fortune.com. 2015-09-25.

External links