Spark Scala Examples: Your baby steps to Big Data

September 10, 2021

This post elaborates on Apache Spark transformation and action operations by providing a step by step walk through of Spark examples in Scala.

Before you dive into these examples, make sure you know some of the basic Apache Spark Concepts.

Below examples are in no particular sequence and is the first part of our five-part Spark Scala examples post. They assume you have an Apache Hadoop ecosystem setup and have some sample data files created.

If you do not have Apache Hadoop installed, follow this link to download and install.

Let us start by looking at 4 Spark examples.

Spark Context
Spark Parallelize
Spark read from hdfs
Spark Filter

Related: Apache Spark basics

1. Spark Context Example - How to run Spark

If you are struggling to figure out how to run a Spark Scala program, this section gets straight to the point.

The first step to writing an Apache Spark application (program) is to invoke the program, which includes initializing the configuration variables and accessing the cluster. SparkContext is the gateway to accessing Spark functionality.

For beginners, the best and simplest option is to use the Scala shell, which auto creates a SparkContext. Below are 4 Spark examples on how to connect and run Spark.

Method 1:

To login to Scala shell, at the command line interface, type "/bin/spark-shell "

Method 2:

To login and run Spark locally without parallelism: "/bin/spark-shell --master local "

Method 3:

To login and run Spark locally in parallel mode, setting the parallelism level to the number of cores on your machine: "/bing/spark-shell --master local[*] "

Method 4:

To login and connect to Yarn in client mode: "/bin/spark-shell --master yarn-client "

2. Spark Parallelize example

Before we look at a couple of examples, it is important to understand what parallelize spark means.

Parallelize is a method to partition an RDD to speed up processing. The syntax for parallelizing an RDD is sc.parallelize(data, p), where ‘p’ represents the number of partitions based on the cluster.

For the most part the number of partitions are automatically set, so you do not need to worry about it.

In the below Spark Scala examples, we look at parallelizeing a sample set of numbers, a List and an Array.

Related: Spark SQL Date functions

Method 1:

To create an RDD using Apache Spark Parallelize method on a sample set of numbers, say 1 thru 100.

scala > val parSeqRDD = sc.parallelize(1 to 100)

Method 2:

To create an RDD from a Scala List using the Parallelize method.

scala > val parNumArrayRDD = sc.parallelize(List("pen","laptop","pencil","mouse"))

Note: To view a sample set of data loaded in the RDD, type this at the command line: parNumArrayRDD.take(3).foreach(println)

Method 3:

To create an RDD from an Array using the Parallelize method. scala > **val parNumArrayRDD = sc.parallelize(Array(1,2,3,4,5))*

Note: To count the number of elements created in the RDD, type this at the comand line: parNumArrayRDD.count()

3. Spark read from hdfs example

Creating an RDD in Apache Spark requires data. In Spark, there are two ways to aquire this data: parallelized collections and external datasets. Data not in an RDD is classified as an external dataset and includes flat files, binary files,sequence files, hdfs file format, HBase, Cassandra or in any random format.

The 3 Spark examples listed below shows you the most common ways to read data from hdfs.

4. Spark filter example

A Spark filter is a transformation operation which takes an existing dataset, applies a reducing function and returns data for which the reducing function returns a true Boolean.

Conceptually, this is similar to applying a column filter in an excel spreadsheet, or a “where” clause in a sql statement.

Listed below are a few Spark Scala examples on using a filter operation.

Prerequisite : Create an RDD with the sample data as show below.
scala > val sampleColorRDD = sc.parallelize(List(“red”, “blue”, “green”, “purple”, “blue”, “yellow”))

Method 1:

To apply a filter on sampleColorRDD and only select the color "blue" from the RDD dataset.

scala > val filterBlueRDD = sampleColorRDD.filter(color => color == "blue")

Method 2:

To apply a filter on sampleColorRDD and select all colors other than the color "blue" from the RDD dataset.

scala > val filterNotBlueRDD = sampleColorRDD.filter(color => color != "blue")

Method 3:

To apply a filter on sampleColorRDD and select multiple colors: red and "blue" from the RDD dataset.

scala > val filterMultipleRDD = sampleColorRDD.filter(color => (color == "blue" || color == "red"))

Note1: To perform a count() action on the filter output and validate, type the below at the command line:

scala > filterMultipleRDD.count()

Note2: Once you get familiar with the basics, you could minimize your code by combining transformation and action operations into a single line as such:
scala > sampleColorRDD.filter(color => (color == “blue” || color == “red”)).count()

Recent Posts

Redshift materialized views: The good, the bad and the ugly

redshift coalesce cover image of girl on computer

Redshift Coalesce: What you need to know to use it correctly

15 Redshift date functions frequently used by developers

Interested in our services ?

email us at : info@obstkel.com

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Spark Scala Examples: Your baby steps to Big Data

1. Spark Context Example - How to run Spark

2. Spark Parallelize example

3. Spark read from hdfs example

4. Spark filter example

Recent Posts

Redshift materialized views: The good, the bad and the ugly

Redshift Coalesce: What you need to know to use it correctly

15 Redshift date functions frequently used by developers

Table of Contents

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

Spark Scala Examples: Your baby steps to Big Data

1. Spark Context Example - *How to run Spark*​

2. Spark Parallelize example

3. Spark read from hdfs example

4. Spark filter example

Recent Posts

Redshift materialized views: The good, the bad and the ugly

Redshift Coalesce: What you need to know to use it correctly

15 Redshift date functions frequently used by developers

Table of Contents

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved

1. Spark Context Example - How to run Spark