Apache Spark

Spark is a distributed, in-memory compute framework. It provides a platform for ingesting, analyzing, and querying data. In addition to high-level APIs in Java, Scala, Python, and R, Spark has a broad ecosystem of applications, including Spark SQL (structured data), MLlib (machine learning), GraphX (graph data), and Spark Streaming (micro-batch data streams).

Developed at UC Berkeley in 2009, Apache Spark is well suited for interactive querying and analysis on extremely large datasets. It’s made available in the cloud by AWS on Elastic MapReduce and Databricks, or on-premises. Spark can scale to thousands of machines to handle petabytes of data.

Apache Spark Use Cases

Flexibility

Because Spark is a platform, it has numerous APIs and a broad ecosystem of applications. Spark works very well on a broad variety of datasets. As such, it’s well suited to serve many units within a single business: data engineering, data science, analytics, and business intelligence.

Spark is a great choice for folks who have invested time and money into a Hadoop cluster, as it can easily slot into existing infrastructure and run on YARN, Mesos, or in standalone mode. Spark is also easy to spin up using AWS Elastic Mapreduce (EMR), making it a relatively low-cost option for one-off or scheduled batch jobs.

Mature SQL Syntax

As of Spark 2.0, Spark is ANSI SQL:2003 compliant, which means Spark SQL supports SQL operations that are not available in other dialects.

Large Developer Network

Spark enjoys a large amount of developer and vendor support. This allows Spark to have a rapid and continuous development cycle.

What is Spark great for?

Very Small Analytical Queries

SparkSQL performs particularly well against smaller analytical queries. Altscale, a business intelligence vendor that focuses on Hadoop solutions, compared Impala, Hive on Tez, Spark, and Presto in their latest benchmark, and Spark was found to perform particularly well (along with Impala) against smaller datasets.

Very Large Analytical Queries

Spark is also particularly well suited to handle very large, billion row data sets. Spark characteristically outperforms other SQL Engines when used to join billion-row tables together. Spark also outperformed all other SQL Engines in AltScale's 2016 benchmark when tested against very largest datasets.

Data Science Workflows

Spark seamlessly integrates with other tools in the data scientist’s toolkit, such as R (via SparkR) and Python (via PySpark), and comes with a machine learning library (MLib) which makes Apache Spark a favorite for data scientists interested in exploring data stored on Hadoop.

Price Considerations

Spark is offered through vendors Databricks and Amazon Web Services.

Databricks prices Spark according to how you use it, while offering free trials of the platform plans available upon request. The different pricing models that Databricks offers are:

  • Community: Free
  • Databricks for Data Engineering (scheduled service)
  • Databricks for Data Analytics (using functionality within a Databricks notebook)
  • Enterprise (comes packed with a host of security features for enterprise)

You can spin up an Amazon Elastic MapReduce (EMR) cluster supporting a large number of configuration options with variable pricing. Because Spark is an in-memory database, it’s recommended that you choose instance types that are memory-optimized.

Hadoop Spark Architecture

Types of data

Spark’s core APIs can process unstructured, semi-structured, and structured data; Spark SQL specifically operates on the latter two, where some schema is present. Spark’s native data structure is the resilient distributed dataset (RDD). RDDs are effectively distributed, fault-tolerant collections of data. Dataframes are the lingua franca between Spark and Spark SQL. They add a columnar structure to data, and are analogous to dataframes in both R and Python (Pandas).

Maximum Recommended Size of Data

Spark can theoretically scale up to thousands of machines, making it able to handle truly big data—petabyte scale. It can also ingest Avro, text, and columnar formats, meaning it can handle many different types of structured data.

Using Spark

Process for new data

You can read new data directly into Spark SQL using JSON or delimited text-file parsers. You can read files from local filesystems, HDFS, S3, RDBMS, and others. Databricks has released a Redshift-to-Spark tool, which enables you to migrate data from Amazon Redshift, via S3, to Spark to query in any manner you choose.

Alternatively, you can set up a pipeline that first uses Spark or Spark streaming to process unstructured data, define an explicit schema for the data, and convert to a dataframe. You can then query the structured data through a number of driver programs, including the spark shell, by submitting an application through beeline or JDBC.

Maintenance

Spark has many parameters and configuration settings for tuning performance. It can be a little overwhelming to newcomers. While Spark will obviously run with defaults in place, optimal performance will likely require tuning.

Most BI tools connect through JDBC or ODBC and, therefore, cannot issue Scala or Python or dataframe operations. Instead, they’ll issue SQL to the Hive metastore. This reduces the flexibility of what one might expect they can get out of Spark and their prefered BI tool. In other words, BI tools are going to need to know the schema to do anything with the underlying data.