obstkel.com logo

What is Apache Spark – explained using mind maps

Apache Spark main image

If you are a beginner to Apache Spark, this Spark tutorial on the fundamentals is for you. Mastering the basics of Apache Spark will help you build a strong foundation before you get to the more complex concepts.

Often times these concepts are intertwined with new terminology. Associating the terminology with each other in a hierarchical manner to build relationships has been determined to help assimilate information effectively, and in a shorter time frame.

 The diagram below is a mind map of the key Apache Spark concepts. A mind map is a technique use to organize information visually, increasing the brain’s ability to retain information dramatically. For more information on mind mapping, read this

 

What is Apache Spark?

Apache Spark modules and use

Apache Spark is a data processing engine. You can use Spark to process large volumes of structured and unstructured data for Analytics, Data Science, Data Engineering and Machine Learning initiatives. 


However, it’s important to note that Apache Spark is not a database or a distributed file system. In other words, it does not store your data. Spark provides raw processing power to crunch data and extract meaningful information. Spark provides 4 modules for this-

 

  • GraphX: Used to process complex relationships between data using Graph theory concepts of Nodes, Edges and Properties.

     

  • MLlib: An extensive Machine Learning library and tools to build smarter apps and prediction systems.

     

  • Spark Streaming: For processing real-time data for analytics.

     

  • Spark SQL: Build interactive queries for batch processing on structured data.


Apache Spark is not limited to a single programming language. You have the flexibility to use Java, Python, Scala, R or SQL to build your programs. Some minor limitations do apply. 


If you are just starting off and do not have any programming experience, I highly recommend starting off with SQL and Scala. Those are the 2 most popular programming languages with the best bang for your buck!

What is Apache Spark RDD?

RDD in Apache Spark stands for Resilient Distributed Dataset. It is nothing more than your data file converted into an internal Spark format, then partitioned and distributed across multiple computers(nodes).

Resilient Distributed Dataset (RDD) sounds intimidating. Keep things simple. Just think of it as data in Apache Spark internal format.

The diagram below depicts the key components of Apache Spark RDD using mind maps. 

mind map of Apache Spark RDD

1. External Datasets

In Apache Spark, an RDD can be created in one of two ways. External Dataset is one such method in which a file external to Apache Spark: local file system, hdfs, NoSQL databases, can be used to create an RDD. 

2. Parallelized Collections

This is an alternate method to create an RDD instead of using the external datasets method. In Parallelized Collections, same data can be type in or copy pasted to create a sample RDD on the fly.

3. Actions

An action is an operation performed on an RDD which returns a value. In the case of an action such as count(), the result of the action operation results in a value. For a list of Apache Spark actions, click here.

4. Transformations

A transformation is an operation performed on an RDD which results in the creation of a new dataset. In the case of a filter tranformation, it takes an existing dataset and reduces it based on the filter criteria. The result is a new dataset. 

Map, filter, distinct, groupByKey, reducbyKey, join are some most commonly used transformations on an RDD dataset in Apache Spark.

For a detailed list of Apache Spark transformations, click here.

5. Key-Value Pairs

A Key-Value pair is a representation of a data value and its attribute as a set. The data attribute often uniquely identifies the value, hence the term Key. 

In Apache Spark, there are more than a handful of operations that work on Key-Value pairs; aggregateByKey, combineByKey, lookup are some examples. 

For a full list of Apache Spark Key-Value pair operations, click here.

What is Apache Spark SQL?

Spark SQL is a module in Apache Spark used for processing structured data. It provides the capability to interact with data using Structured Query Language (SQL) or the Dataset application programming interface.

The main benefit of the Spark SQL module is that it brings the familiarity of SQL for interacting with data. For a comprehensive list of  Spark SQL functions, click here. 

1. Datasets

A Dataset in the context of Apache Spark SQL is a collection of data distributed over one or more partitions. 

2. DataFrames

A Spark SQL DataFrame is a Dataset of rows distributed over one or more partitions, arranged into named columns. It is analogous to a table in a relational database.

Apache Spark Data Sources and Format

Spark SQL has the functionality to operate on data in a number of different formats. Parquet, JSON, Hive and ORC are some of these formats. Spark SQL loads  data in these formats into a DataFrame, which can then be queried using SQL or transformations.

1. Parquet files

Apache Parquet is a columnar storage format by the Apache Software Foundation and is heavily used in the Hadoop Ecosystem. The advantage of Parquet lies in its nested data structures and efficient compression. For documentation and examples on parquet usage, click here

2. JSON Datasets

JSON, short of JavaScript Object Notation is a format recognized for its ease of readability by humans and machines. The core abstract of JSON is its name/value pair collection and ordered list of values.

Spark SQL has the built in functionality to recognize a JSON schema and load the data into a Spark SQL Dataset.

For documentation on JSON structure, click here

3. Hive Tables

Hive is a Data Warehouse application by Apache Software Foundation. The tables in Hive can be read by Apache Spark SQL for analysis, summary and other related processing.

For documentation on Apache Hive tables, and DDL examples, click here

4. ORC Files

ORC short for Optimized RC file format is another columnar data format supported by Apache Spark. Some of ORC’s features include built in indexes, complex type and ACID support. 

For more details on ORC , click here.

Spark SQL links
Programming Guide

The official Apache Spark v2.3.2 Spark SQL Programming Guide with everything you need to know in a single place.

Hive Tutorial

Learn how Apache Hive fits into the Hadoop ecosystem with this Hive Tutorial for Beginners on guru99.com.

Recent Posts

Table of Contents

Interested in our services ?
email us at : info@obstkel.com
Copyright 2022 © OBSTKEL LLC. All rights Reserved