Athena SQL basics – How to write SQL against files

aws athena sql picture

Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below.

    1. Athena Data Types
    2. Athena SQL Operators
    3. Athena SQL Functions
      1. Aggregate Functions
      2. Date Functions
      3. String Functions
      4. Window Functions
    4. Athena SQL DDL
    5. Athena SQL DML
 

Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.

Whats is Amazon Athena ?

Amazon Athena, is a web service by AWS used to analyze data in Amazon S3 using SQL.

It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform. In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.

What makes Amazon Athena different ?

The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.

Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.

Thirdly, Amazon Athena is serverless, which means provisioning capacity,  scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.

1. Athena Data types

A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.

Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog. 

AWS Athena has 18 distinct data types, which are listed below in alphabetical order.

  • ARRAY
  • BIGINT
  • BINARY
  • BOOLEAN
  • CHAR
  • DATE
  • DECIMAL
  • DOUBLE
  • FLOAT
  • INT
  • INTEGER
  • MAP
  • SMALLINT
  • STRING
  • STRUCT
  • TIMESTAMP
  • TINYINT
  • VARCHAR 

2. Athena SQL Operators

An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.

Athena SQL has nine different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.

The below table lists the Operator definitions and syntax in Athena SQL.

Less than

Greater than

<=

Less than or equal to

>=

Greater than or equal to

=

Equal

<>  or  !=

Not equal

+

Addition

Subtraction

*

Multiplication

/

Division

%

Remainder or Modulus

||

Concatenate

AND

Logical AND

OR

Logical OR

NOT

Logical NOT

3. Athena SQL Functions

A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.

Athena SQL Functions can be categorized into 17 areas as below.

Let us start by looking at the four most commonly used categories: Athena Aggregate Functions, Athena String Functions, Athena Date Functions and Athena Window Functions.

3.1 Athena Aggregate Functions

In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.

The table below lists all the aggregate functions in Athena with the sql syntax.

Function
Description
approx_distinct(x )Returns the approximate number of distinct input values
approx_distinct(x, e)Returns the approximate number of distinct input values with a standard error less than e
approx_percentile(x, percentage )Returns the approximate percentile for all input values of x at the given percentage
approx_percentile(x, percentages )Returns the approximate percentile for all input values of x at each of the specified percentages
approx_percentile(x, w, percentage )Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p
approx_percentile(x, w, percentage, accuracy)Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy
approx_percentile(x, w, percentages)Returns the approximate weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array
arbitrary(x)Returns an arbitrary non-null value of x
array_agg(x)Returns an array created from the input x elements
avg(x)Returns the average (arithmetic mean) of all input values
bitwise_and_agg(x)Returns the bitwise AND of all input values in 2’s complement representation
bitwise_or_agg(x )Returns the bitwise OR of all input values in 2’s complement representation
bool_and(boolean )Returns TRUE if every input value is TRUE, otherwise FALSE
bool_or(boolean)Returns TRUE if any input value is TRUE, otherwise FALSE
checksum(x )Returns an order-insensitive checksum of the given values
corr(y, x)Returns correlation coefficient of input values
count(* )Returns the number of input rows
count(x )Returns the number of non-null input values
count_if(x )Returns the number of TRUE input values
covar_pop(y, x)Returns the population covariance of input values
covar_samp(y, x)Returns the sample covariance of input values
every(boolean)Alias for bool_and() function
geometric_mean(x )Returns the geometric mean of all input values
histogram(x)Returns a map containing the count of the number of times each input value occurs
kurtosis(x)Returns the excess kurtosis of all input values
map_agg(key, value)Returns a map created from the input key / value pairs
map_union(x<K, V>)Returns the union of all the input maps
max(x)Returns the maximum value of all input values
max(x, n)Returns n largest values of all input values of x
max_by(x, y)Returns the value of x associated with the maximum value of y over all input values
max_by(x, y, n )Returns n values of x associated with the n largest of all input values of y in descending order of y
min(x)Returns the minimum value of all input values
min(x, n )Returns n smallest values of all input values of x
min_by(x, y )Returns the value of x associated with the minimum value of y over all input values
min_by(x, y, n )Returns n values of x associated with the n smallest of all input values of y in ascending order of y
multimap_agg(key, value)Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values
numeric_histogram(buckets, value)Computes an approximate histogram with up to buckets number of buckets for all values
numeric_histogram(buckets, value, weight )Computes an approximate histogram with up to buckets number of buckets for all values with a per-item weight of weight
regr_intercept(y, x)Returns linear regression intercept of input values. y is the dependent value. x is the independent value
regr_slope(y, x)Returns linear regression slope of input values. y is the dependent value. x is the independent value
skewness(x )Returns the skewness of all input values
stddev(x)Alias for stddev_samp() function
stddev_pop(x )Returns the population standard deviation of all input values
stddev_samp(x )Returns the sample standard deviation of all input values
sum(x)Returns the sum of all input values
var_pop(x)Returns the population variance of all input values
var_samp(x)Returns the sample variance of all input values
variance(x )Alias for var_samp() function

3.2 Athena String Functions

Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings. 

Since Athena is based on Presto, Athena String functions are a one to one match between the two.  The table below lists string functions, and the Athena SQL syntax for it.

Function
Description
chr(n )Returns the Unicode code point n as a single character string
codepoint(string)Returns the Unicode code point of the only character of string
concat(string1, …, stringN)Returns the concatenation of string1, string2, …, stringN
from_utf8(binary)Decodes a UTF-8 encoded string from binary
from_utf8(binary, replace )Decodes a UTF-8 encoded string from binary
length(string)This function returns the length of string in Athena
levenshtein_distance(string1, string2)Returns the Levenshtein edit distance of string1 and string2
lower(string)Converts string to lowercase
lpad(string, size, padstring)Left pads string to size characters with padstring
ltrim(string)Removes leading whitespace from string
normalize(string)Transforms string with NFC normalization form
normalize(string, form)Transforms string with the specified normalization form
position(substring IN string)Returns the starting position of the first instance of substring in string
replace(string, search )Removes all instances of search from string
replace(string, search, replace )Replaces all instances of search with replace in string
reverse(string)Returns string with the characters in reverse order
rpad(string, size, padstring)Right pads string to size characters with padstring
rtrim(string )Removes trailing whitespace from string
split(string, delimiter )Splits string on delimiter and returns an array
split(string, delimiter, limit)Splits string on delimiter and returns an array of size at most limit
split_part(string, delimiter, index)Splits string on delimiter and returns the field index
split_to_map(string, entryDelimiter, keyValueDelimiter)Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs
strpos(string, substring )Returns the starting position of the first instance of substring in string
substr(string, start )This Athena substring function returns a subset of a given string starting at position start
substr(string, start, length )If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function
to_utf8(string)Encodes string into a UTF-8 varbinary representation
trim(string)Removes leading and trailing whitespace from string
upper(string )Converts string to uppercase

3.3 Athena Date Functions

Athena Date Functions have some quirks you need to be familiar with.  

  • Date Functions listed without parenthesis below do not require them
  • The Unit parameter below can range from time to year. The valid unit values and formats are : millisecond, second, minute, hour, day, week, month, quarter, year.
  • Athena Date and time format specifiers are listed in the table below.

%a

Abbreviated weekday name (Sun .. Sat)

%I

Hour (01 .. 12)

%r

Time, 12-hour

%b

Abbreviated month name (Jan .. Dec)

%i

Minutes, numeric (00 .. 59)

%s

Seconds (00 .. 59)

%c

Month, numeric (0 .. 12)

%j

Day of year (001 .. 366)

%T

Time, 24-hour

%d

Day of the month, numeric (00 .. 31)

%k

Hour (0 .. 23)

%v

Week (01 .. 53)

%e

Day of the month, numeric (0 .. 31)

%l

Hour (1 .. 12)

%W

Weekday name

%f

Fraction of second

%M

Month

%Y

Year, numeric, four digits

%H

Hour (00 .. 23)

%m

Month in numeric

%y

Year, numeric (two digits) [2]

%h

Hour (01 .. 12)

%p

AM or PM

%r

Time, 12-hour

Function

Description

current_date

Returns the current date as of the start of the query

current_time

Returns the current time as of the start of the query

current_timestamp

Returns the current timestamp as of the start of the query

current_timezone( )

Returns the current time zone

date_add(unit, value, timestamp)

Adds an interval value of type unit to timestamp

date_diff(unit, timestamp1, timestamp2)

Returns timestamp2 – timestamp1 expressed in terms of unit

date_format(timestamp, format)

Formats timestamp as a string using format

date_parse(string, format)

Parses string into a timestamp using format

date_trunc(unit, x)

Returns x truncated to unit

day(x)

Returns the day of the month from x

day_of_month(x)

This is an alias for day()

day_of_week(x)

Returns the ISO day of the week from x

day_of_year(x)

Returns the day of the year from x

extract(field FROM x)

Returns field from x where field can be DAY, DAY_OF_MONTH, DAY_OF_WEEK, DAY_OF_YEAR, HOUR, MINUTE, MONTH, QUARTER, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE, WEEK, YEAR, YEAR_OF_WEEK

format_datetime(timestamp, format)

Formats timestamp as a string using format

from_iso8601_date(string)

Parses the ISO 8601 formatted string into a date

from_iso8601_timestamp(string)

Parses the ISO 8601 formatted string into a timestamp with time zone

from_unixtime(unixtime)

Returns the UNIX timestamp unixtime as a timestamp

from_unixtime(unixtime, hours, minutes)

Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes

from_unixtime(unixtime, string)

Returns the UNIX timestamp unixtime as a timestamp

hour(x)

Returns the hour of the day from x

localtime

Returns the current time as of the start of the query

localtimestamp

Returns the current timestamp as of the start of the query

minute(x)

Returns the minute of the hour from x

month(x)

Returns the month of the year from x

now()

This is an alias for current_timestamp

parse_datetime(string, format)

Parses string into a timestamp with time zone using format

quarter(x)

Returns the quarter of the year from x

3.4 Athena Window Functions

Type

Function

Description

Aggregate Function

* Refer Aggregate Functions tab

Agrregate Functions can be used as window functions by adding the OVER clause

Ranking Function

cume_dist( )

Returns the cumulative distribution of a value in a group of values

Ranking Function

dense_rank()

Returns the rank of a value in a group of values

Ranking Function

ntile(n)

Divides the rows for each window partition into n buckets ranging from 1 to at most n

Ranking Function

percent_rank()

Returns the percentage ranking of a value in group of values

Ranking Function

rank( )

Returns the rank of a value in a group of values

Ranking Function

row_number( )

Returns a unique, sequential number for each row, starting with one

Value Function

first_value(x)

Returns the first value of the window

Value Function

last_value(x)

Returns the last value of the window

Value Function

nth_value(x, offset)

Returns the value at the specified offset from beginning the window

Value Function

lead(x[, offset[, default_value]] )

Returns the value at offset rows after the current row in the window

Value Function

lag(x[, offset[, default_value]])

Returns the value at offset rows before the current row in the window starting at 0

4. Athena SQL DDL Clauses

DDL stands for Data Definition Language, and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.

Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar. Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For example, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.

The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.

1.      ALTER DATABASE SET DBPROPERTIES

10.   CREATE TABLE

19.   SHOW COLUMNS

2.      ALTER TABLE ADD COLUMNS

11.   CREATE TABLE AS

20.   SHOW CREATE TABLE

3.      ALTER TABLE ADD PARTITION

12.   CREATE VIEW

21.   SHOW CREATE VIEW

4.      ALTER TABLE DROP PARTITION

13.   DESCRIBE TABLE

22.   SHOW DATABASES

5.      ALTER TABLE RENAME PARTITION

14.   DESCRIBE VIEW

23.   SHOW PARTITIONS

6.      ALTER TABLE REPLACE COLUMNS

15.   DROP DATABASE

24.   SHOW TABLES

7.      ALTER TABLE SET LOCATION

16.   DROP TABLE

25.   SHOW TBLPROPERTIES

8.      ALTER TABLE SET TBLPROPERTIES

17.   DROP VIEW

26.   SHOW VIEWS

9.      CREATE DATABASE

18.   MSCK REPAIR TABLE

 

5. Athena SQL DML Clauses

DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.

When an Athena SQL DML statement is executed it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL. Currently, the only Athena SQL DML supported is the SELECT statement.

Helpful links

AWS Athena Documentation

This is the latest user guide version of AWS Athena Documentation

AWS Athena CLI

AWS Athena CLI (Command Line Interface) allows users to interact with Athena using shell commands from Windows PowerShell, Linux or remotely.

Service Offerings by Obstkel

Get to know the AWS Cloud Services offered by Obstkel

Interested in our services ?

email us at : info@obstkel.com

Copyright 2021 © OBSTKEL LLC. All rights Reserved