If your are a beginner, understanding the basics of Apache Spark will help you build a strong foundation before you get to the more complex concepts. Often times these concepts are intertwined with new terminology. Associating the terminology with each other in a hierarchical manner to build relationships has been determined to help assimilate information effectively, and in a shorter time frame.
The diagram below is a mind map of the key Apache Spark concepts. A mind map is a technique use to organize information visually, increasing the brains ability to retain information dramatically. For more information on mind mapping, read this.
The central idea in the mind map below is Apache Spark, with Resilient Distributed Datasets (RDD) and Spark SQL as the main branches. Each one of those main branches have secondary and tertiary branches radiating from it, denoting the relationships between the Apache Spark concepts. Print a copy of the below diagram and refer to it until these sparks concepts become second nature.
A collection of elements partitioned and distributed across multiple nodes in a cluster making it resilient in nature.
In Apache Spark, an RDD can be created in one of two ways. External Dataset is one such method in which a file external to Apache Spark: local file system, hdfs, NoSQL databases, can be used to create an RDD.
This is an alternate method to create an RDD instead of using the external datasets method. In Parallelized Collections, same data can be type in or copy pasted to create a sample RDD on the fly.
An action is an operation performed on an RDD which returns a value. In the case of an action such as count(), the result of the action operation results in a value. For a list of Apache Spark actions, click here.
A transformation is an operation performed on an RDD which results in the creation of a new dataset. In the case of a filter tranformation, it takes an existing dataset and reduces it based on the filter criteria. The result is a new dataset.
Map, filter, distinct, groupByKey, reducbyKey, join are some most commonly used transformations on an RDD dataset in Apache Spark.
For a detailed list of Apache Spark transformations, click here.
A Key-Value pair is a representation of a data value and its attribute as a set. The data attribute often uniquely identifies the value, hence the term Key.
In Apache Spark, there are more than a handful of operations that work on Key-Value pairs; aggregateByKey, combineByKey, lookup are some examples.
For a full list of Apache Spark Key-Value pair operations, click here.
Spark SQL is a module in Apache Spark used for processing structured data. It provides the capability to interact with data using Structured Query Language (SQL) or the Dataset application programming interface.
The main benefit of the Spark SQL module is that it brings the familiarity of SQL for interacting with data. For a comprehensive list of Spark SQL functions, click here.
A Dataset in the context of Apache Spark SQL is a collection of data distributed over one or more partitions.
A Spark SQL DataFrame is a Dataset of rows distributed over one or more partitions, arranged into named columns. It is analogous to a table in a relational database.
Spark SQL has the functionality to operate on data in a number of different formats. Parquet, JSON, Hive and ORC are some of these formats. Spark SQL loads data in these formats into a DataFrame, which can then be queried using SQL or transformations.
Apache Parquet is a columnar storage format by the Apache Software Foundation and is heavily used in the Hadoop Ecosystem. The advantage of Parquet lies in its nested data structures and efficient compression. For documentation and examples on parquet usage, click here.
Spark SQL has the built in functionality to recognize a JSON schema and load the data into a Spark SQL Dataset.
For documentation on JSON structure, click here
Hive is a Data Warehouse application by Apache Software Foundation. The tables in Hive can be read by Apache Spark SQL for analysis, summary and other related processing.
For documentation on Apache Hive tables, and DDL examples, click here.
ORC short for Optimized RC file format is another columnar data format supported by Apache Spark. Some of ORC’s features include built in indexes, complex type and ACID support.
For more details on ORC , click here.