Athena SQL basics – How to write SQL against files

September 14, 2021

Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below.

Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.

What is Amazon Athena?

Amazon Athena is a web service by AWS used to analyze data in Amazon S3 using SQL.
It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform.

In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.

What makes Amazon Athena different?

The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.

Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.

Thirdly, Amazon Athena is serverless, which means provisioning capacity, scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.

1. Athena Data types

A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.

Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog.

AWS Athena has 18 distinct data types, which are listed below in alphabetical order.

ARRAY
BIGINT
BINARY
BOOLEAN
CHAR
DATE
DECIMAL
DOUBLE
FLOAT
INT
INTEGER
MAP
SMALLINT
STRING
STRUCT
TIMESTAMP
TINYINT
VARCHAR

2. Athena SQL Operators

An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.

Athena SQL has 9 different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.

The below table lists the Operator definitions and syntax in Athena SQL.

<	Less than
>	Greater than
<=	Less than or equal to
>=	Greater than or equal to
=	Equal
<> or !=	Not equal
+	Addition
–	Subtraction

*	Multiplication
/	Division
%	Remainder or Modulus
\|\|	Concatenate
AND	Logical AND
OR	Logical OR
NOT	Logical NOT

3. Athena SQL Functions

A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.

Athena SQL Functions are broken down into 24 areas, which is way beyond the scope of this post. To keep things relevant, we will be focusing on the commonly used function categories.

Athena Aggregate Functions.
Athena String Functions.
Athena Date Functions.
Athena Window Functions.

3.1 Athena Aggregate Functions

In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.

The table below lists all the aggregate functions in Athena with the sql syntax.

Function	Description
approx_distinct(x )	Returns the approximate number of distinct input values
approx_distinct(x, e)	Returns the approximate number of distinct input values with a standard error less than e
approx_percentile(x, percentage )	Returns the approximate percentile for all input values of x at the given percentage
approx_percentile(x, percentages )	Returns the approximate percentile for all input values of x at each of the specified percentages
approx_percentile(x, w, percentage )	Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p
approx_percentile(x, w, percentage, accuracy)	Returns the approximate weighed percentile for all input values of x using the per-item weight w at the percentage p, with a maximum rank error of accuracy
approx_percentile(x, w, percentages)	Returns the approximate weighed percentile for all input values of x using the per-item weight w at each of the given percentages specified in the array
arbitrary(x)	Returns an arbitrary non-null value of x
array_agg(x)	Returns an array created from the input x elements
avg(x)	Returns the average (arithmetic mean) of all input values
bitwise_and_agg(x)	Returns the bitwise AND of all input values in 2’s complement representation
bitwise_or_agg(x )	Returns the bitwise OR of all input values in 2’s complement representation
bool_and(boolean )	Returns TRUE if every input value is TRUE, otherwise FALSE
bool_or(boolean)	Returns TRUE if any input value is TRUE, otherwise FALSE
checksum(x )	Returns an order-insensitive checksum of the given values
corr(y, x)	Returns correlation coefficient of input values
count(* )	Returns the number of input rows
count(x )	Returns the number of non-null input values
count_if(x )	Returns the number of TRUE input values
covar_pop(y, x)	Returns the population covariance of input values
covar_samp(y, x)	Returns the sample covariance of input values
every(boolean)	Alias for bool_and() function
geometric_mean(x )	Returns the geometric mean of all input values
histogram(x)	Returns a map containing the count of the number of times each input value occurs
kurtosis(x)	Returns the excess kurtosis of all input values
map_agg(key, value)	Returns a map created from the input key / value pairs
map_union(x<K, V>)	Returns the union of all the input maps
max(x)	Returns the maximum value of all input values
max(x, n)	Returns n largest values of all input values of x
max_by(x, y)	Returns the value of x associated with the maximum value of y over all input values
max_by(x, y, n )	Returns n values of x associated with the n largest of all input values of y in descending order of y
min(x)	Returns the minimum value of all input values
min(x, n )	Returns n smallest values of all input values of x
min_by(x, y )	Returns the value of x associated with the minimum value of y over all input values
min_by(x, y, n )	Returns n values of x associated with the n smallest of all input values of y in ascending order of y
multimap_agg(key, value)	Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values
numeric_histogram(buckets, value)	Computes an approximate histogram with up to buckets number of buckets for all values
numeric_histogram(buckets, value, weight )	Computes an approximate histogram with up to buckets number of buckets for all values with a per-item weight of weight
regr_intercept(y, x)	Returns linear regression intercept of input values. y is the dependent value. x is the independent value
regr_slope(y, x)	Returns linear regression slope of input values. y is the dependent value. x is the independent value
skewness(x )	Returns the skewness of all input values
stddev(x)	Alias for stddev_samp() function
stddev_pop(x )	Returns the population standard deviation of all input values
stddev_samp(x )	Returns the sample standard deviation of all input values
sum(x)	Returns the sum of all input values
var_pop(x)	Returns the population variance of all input values
var_samp(x)	Returns the sample variance of all input values
variance(x )	Alias for var_samp() function

3.2 Athena String Functions

Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings.

Since Athena is based on Presto, Athena String functions are a one-to-one match between the two. The table below lists string functions, and the Athena SQL syntax for it.

Function	Description
chr(n )	Returns the Unicode code point n as a single character string
codepoint(string)	Returns the Unicode code point of the only character of string
concat(string1, …, stringN)	Returns the concatenation of string1, string2, …, stringN
from_utf8(binary)	Decodes a UTF-8 encoded string from binary
from_utf8(binary, replace )	Decodes a UTF-8 encoded string from binary
length(string)	This function returns the length of string in Athena
levenshtein_distance(string1, string2)	Returns the Levenshtein edit distance of string1 and string2
lower(string)	Converts string to lowercase
lpad(string, size, padstring)	Left pads string to size characters with padstring
ltrim(string)	Removes leading whitespace from string
normalize(string)	Transforms string with NFC normalization form
normalize(string, form)	Transforms string with the specified normalization form
position(substring IN string)	Returns the starting position of the first instance of substring in string
replace(string, search )	Removes all instances of search from string
replace(string, search, replace )	Replaces all instances of search with replace in string
reverse(string)	Returns string with the characters in reverse order
rpad(string, size, padstring)	Right pads string to size characters with padstring
rtrim(string )	Removes trailing whitespace from string
split(string, delimiter )	Splits string on delimiter and returns an array
split(string, delimiter, limit)	Splits string on delimiter and returns an array of size at most limit
split_part(string, delimiter, index)	Splits string on delimiter and returns the field index
split_to_map(string, entryDelimiter, keyValueDelimiter)	Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into key-value pairs
strpos(string, substring )	Returns the starting position of the first instance of substring in string
substr(string, start )	This Athena substring function returns a subset of a given string starting at position start
substr(string, start, length )	If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function
to_utf8(string)	Encodes string into a UTF-8 varbinary representation
trim(string)	Removes leading and trailing whitespace from string
upper(string )	Converts string to uppercase

3.3 Athena Date Functions

Athena Date Functions have some quirks you need to be familiar with.

Date Functions listed without parenthesis below do not require them.
The Unit parameter below can range from time to year. The valid unit values and formats are millisecond, second, minute, hour, day, week, month, quarter, year.
Athena Date and time format specifiers are listed in the table below.

%a	Abbreviated weekday name (Sun .. Sat)	%I	Hour (01 .. 12)	%r	Time, 12-hour
%b	Abbreviated month name (Jan .. Dec)	%i	Minutes, numeric (00 .. 59)	%s	Seconds (00 .. 59)
%c	Month, numeric (0 .. 12)	%j	Day of year (001 .. 366)	%T	Time, 24-hour
%d	Day of the month, numeric (00 .. 31)	%k	Hour (0 .. 23)	%v	Week (01 .. 53)
%e	Day of the month, numeric (0 .. 31)	%l	Hour (1 .. 12)	%W	Weekday name
%f	Fraction of second	%M	Month	%Y	Year, numeric, four digits
%H	Hour (00 .. 23)	%m	Month in numeric	%y	Year, numeric (two digits) [2]
%h	Hour (01 .. 12)	%p	AM or PM	%r	Time, 12-hour

Function	Description
current_date	Returns the current date as of the start of the query
current_time	Returns the current time as of the start of the query
current_timestamp	Returns the current timestamp as of the start of the query
current_timezone( )	Returns the current time zone
date_add(unit, value, timestamp)	Adds an interval value of type unit to timestamp
date_diff(unit, timestamp1, timestamp2)	Returns timestamp2 – timestamp1 expressed in terms of unit
date_format(timestamp, format)	Formats timestamp as a string using format
date_parse(string, format)	Parses string into a timestamp using format
date_trunc(unit, x)	Returns x truncated to unit
day(x)	Returns the day of the month from x
day_of_month(x)	This is an alias for day()
day_of_week(x)	Returns the ISO day of the week from x
day_of_year(x)	Returns the day of the year from x
extract(field FROM x)	Returns field from x where field can be DAY, DAY_OF_MONTH, DAY_OF_WEEK, DAY_OF_YEAR, HOUR, MINUTE, MONTH, QUARTER, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE, WEEK, YEAR, YEAR_OF_WEEK
format_datetime(timestamp, format)	Formats timestamp as a string using format
from_iso8601_date(string)	Parses the ISO 8601 formatted string into a date
from_iso8601_timestamp(string)	Parses the ISO 8601 formatted string into a timestamp with time zone
from_unixtime(unixtime)	Returns the UNIX timestamp unixtime as a timestamp
from_unixtime(unixtime, hours, minutes)	Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes
from_unixtime(unixtime, string)	Returns the UNIX timestamp unixtime as a timestamp
hour(x)	Returns the hour of the day from x
localtime	Returns the current time as of the start of the query
localtimestamp	Returns the current timestamp as of the start of the query
minute(x)	Returns the minute of the hour from x
month(x)	Returns the month of the year from x
now()	This is an alias for current_timestamp
parse_datetime(string, format)	Parses string into a timestamp with time zone using format
quarter(x)	Returns the quarter of the year from x

3.4 Athena Window Functions

Type	Function	Description
Aggregate Function	* *Refer Aggregate Functions tab*	Agrregate Functions can be used as window functions by adding the OVER clause
Ranking Function	cume_dist( )	Returns the cumulative distribution of a value in a group of values
Ranking Function	dense_rank()	Returns the rank of a value in a group of values
Ranking Function	ntile(n)	Divides the rows for each window partition into n buckets ranging from 1 to at most n
Ranking Function	percent_rank()	Returns the percentage ranking of a value in group of values
Ranking Function	rank( )	Returns the rank of a value in a group of values
Ranking Function	row_number( )	Returns a unique, sequential number for each row, starting with one
Value Function	first_value(x)	Returns the first value of the window
Value Function	last_value(x)	Returns the last value of the window
Value Function	nth_value(x, offset)	Returns the value at the specified offset from beginning the window
Value Function	lead(x[, offset[, default_value]] )	Returns the value at offset rows after the current row in the window
Value Function	lag(x[, offset[, default_value]])	Returns the value at offset rows before the current row in the window starting at 0

4. Athena SQL DDL Clauses

DDL stands for Data Definition Language and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.

Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar.

Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For instance, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.

The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.

1. ALTER DATABASE SET DBPROPERTIES	10. CREATE TABLE	19. SHOW COLUMNS
2. ALTER TABLE ADD COLUMNS	11. CREATE TABLE AS	20. SHOW CREATE TABLE
3. ALTER TABLE ADD PARTITION	12. CREATE VIEW	21. SHOW CREATE VIEW
4. ALTER TABLE DROP PARTITION	13. DESCRIBE TABLE	22. SHOW DATABASES
5. ALTER TABLE RENAME PARTITION	14. DESCRIBE VIEW	23. SHOW PARTITIONS
6. ALTER TABLE REPLACE COLUMNS	15. DROP DATABASE	24. SHOW TABLES
7. ALTER TABLE SET LOCATION	16. DROP TABLE	25. SHOW TBLPROPERTIES
8. ALTER TABLE SET TBLPROPERTIES	17. DROP VIEW	26. SHOW VIEWS
9. CREATE DATABASE	18. MSCK REPAIR TABLE

5. Athena SQL DML Clauses

DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.

When an Athena SQL DML statement is executed, it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL.

Currently, the only Athena SQL DML supported is the SELECT statement.

Redshift materialized views: The good, the bad and the ugly

redshift coalesce cover image of girl on computer

Redshift Coalesce: What you need to know to use it correctly

15 Redshift date functions frequently used by developers

Helpful links

AWS Athena CLI

Interact with Athena using shell commands from Windows PowerShell, Linux or remotely.

AWS Athena Documentation

This is the latest user guide version of AWS Athena Documentation

Interested in our services ?

email us at : info@obstkel.com

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.