If you are a beginner to Apache Spark, this Spark tutorial on the fundamentals is for you. Mastering the basics of Apache Spark will help you build a strong foundation before you get to the more complex concepts.
Often times these concepts are intertwined with new terminology. Associating the terminology with each other in a hierarchical manner to build relationships has been determined to help assimilate information effectively, and in a shorter time frame.
The diagram below is a mind map of the key Apache Spark concepts. A mind map is a technique use to organize information visually, increasing the brain’s ability to retain information dramatically. For more information on mind mapping, read this.
Apache Spark is a data processing engine. You can use Spark to process large volumes of structured and unstructured data for Analytics, Data Science, Data Engineering and Machine Learning initiatives.
However, it’s important to note that Apache Spark is not a database or a distributed file system. In other words, it does not store your data. Spark provides raw processing power to crunch data and extract meaningful information. Spark provides 4 modules for this-
Apache Spark is not limited to a single programming language. You have the flexibility to use Java, Python, Scala, R or SQL to build your programs. Some minor limitations do apply.
If you are just starting off and do not have any programming experience, I highly recommend starting off with SQL and Scala. Those are the 2 most popular programming languages with the best bang for your buck!
RDD in Apache Spark stands for Resilient Distributed Dataset. It is nothing more than your data file converted into an internal Spark format, then partitioned and distributed across multiple computers(nodes).
Resilient Distributed Dataset (RDD) sounds intimidating. Keep things simple. Just think of it as data in Apache Spark internal format.
The diagram below depicts the key components of Apache Spark RDD using mind maps.
In Apache Spark, an RDD can be created in one of two ways. External Dataset is one such method in which a file external to Apache Spark: local file system, hdfs, NoSQL databases, can be used to create an RDD.
This is an alternate method to create an RDD instead of using the external datasets method. In Parallelized Collections, same data can be type in or copy pasted to create a sample RDD on the fly.
An action is an operation performed on an RDD which returns a value. In the case of an action such as count(), the result of the action operation results in a value. For a list of Apache Spark actions, click here.
A transformation is an operation performed on an RDD which results in the creation of a new dataset. In the case of a filter tranformation, it takes an existing dataset and reduces it based on the filter criteria. The result is a new dataset.
Map, filter, distinct, groupByKey, reducbyKey, join are some most commonly used transformations on an RDD dataset in Apache Spark.
For a detailed list of Apache Spark transformations, click here.
A Key-Value pair is a representation of a data value and its attribute as a set. The data attribute often uniquely identifies the value, hence the term Key.
In Apache Spark, there are more than a handful of operations that work on Key-Value pairs; aggregateByKey, combineByKey, lookup are some examples.
For a full list of Apache Spark Key-Value pair operations, click here.
Spark SQL is a module in Apache Spark used for processing structured data. It provides the capability to interact with data using Structured Query Language (SQL) or the Dataset application programming interface.
The main benefit of the Spark SQL module is that it brings the familiarity of SQL for interacting with data. For a comprehensive list of Spark SQL functions, click here.
A Dataset in the context of Apache Spark SQL is a collection of data distributed over one or more partitions.
A Spark SQL DataFrame is a Dataset of rows distributed over one or more partitions, arranged into named columns. It is analogous to a table in a relational database.
Spark SQL has the functionality to operate on data in a number of different formats. Parquet, JSON, Hive and ORC are some of these formats. Spark SQL loads data in these formats into a DataFrame, which can then be queried using SQL or transformations.
Apache Parquet is a columnar storage format by the Apache Software Foundation and is heavily used in the Hadoop Ecosystem. The advantage of Parquet lies in its nested data structures and efficient compression. For documentation and examples on parquet usage, click here.
Spark SQL has the built in functionality to recognize a JSON schema and load the data into a Spark SQL Dataset.
For documentation on JSON structure, click here
Hive is a Data Warehouse application by Apache Software Foundation. The tables in Hive can be read by Apache Spark SQL for analysis, summary and other related processing.
For documentation on Apache Hive tables, and DDL examples, click here.
ORC short for Optimized RC file format is another columnar data format supported by Apache Spark. Some of ORC’s features include built in indexes, complex type and ACID support.
For more details on ORC , click here.