Athena SQL is the query language used in Amazon Athena to interact with data in S3. Mastering Athena SQL is not a monumental task if you get the basics right. There are 5 areas you need to understand as listed below.
Before we get to the SQL part, lets make sure you have a good understanding of what Amazon Athena is.
Amazon Athena, is a web service by AWS used to analyze data in Amazon S3 using SQL.
It runs in the Cloud (or a server) and is part of the AWS Cloud Computing Platform. In many respects, it is like a SQL graphical user interface (GUI) we use against a relational database to analyze data. The main difference is Amazon Athena helps you read and analyze data in files using SQL instead of data stored in a database.
The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semistructured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.
Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.
Thirdly, Amazon Athena is serverless, which means provisioning capacity, scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.
A Data Type defines the attributes of a value. It also classifies the SQL operations that can be performed on a value. For example, an Athena data type of DATE denotes that a value is a date, and should contain Year, Month and Day information. It also means only DATE related SQL operations can be performed on that value.
Similar to defining Data Types in a relational database, AWS Athena Data Types are defined for each column in a table. These data types form the meta data definition of the dataset, which is stored in the AWS Glue Data Catalog.
AWS Athena has 18 distinct data types, which are listed below in alphabetical order.
An Operator performs an action on one or more data values. For example, every time we add two numbers, we are performing an addition operation using the “+” operator.
Athena SQL has nine different types of Operators depending on the data type. They are Array Operators, Comparison Operators, Decimal Operators, Date and Time Operators, JSON Operators, Logical Operators, Map Operators, Mathematical Operators and String Operators.
The below table lists the Operator definitions and syntax in Athena SQL.
<  Less than 
>  Greater than 
<=  Less than or equal to 
>=  Greater than or equal to 
=  Equal 
<> or !=  Not equal 
+  Addition 
–  Subtraction 
*  Multiplication 
/  Division 
%  Remainder or Modulus 
  Concatenate 
AND  Logical AND 
OR  Logical OR 
NOT  Logical NOT 
A function in Athena SQL is very similar to an Operator. Operators are great for performing simple operations. Functions on the other hand performs complex computations on multiple columns simultaneously.
Athena SQL Functions can be categorized into 17 areas as below.
Let us start by looking at the four most commonly used categories: Athena Aggregate Functions, Athena String Functions, Athena Date Functions and Athena Window Functions.
In Athena, aggregate functions are used to create a condensed or summarized view of your data. They work the same as in any relational database.
The table below lists all the aggregate functions in Athena with the sql syntax.
approx_distinct(x )  Returns the approximate number of distinct input values 
approx_distinct(x, e)  Returns the approximate number of distinct input values with a standard error less than e 
approx_percentile(x, percentage )  Returns the approximate percentile for all input values of x at the given percentage 
approx_percentile(x, percentages )  Returns the approximate percentile for all input values of x at each of the specified percentages 
approx_percentile(x, w, percentage )  Returns the approximate weighed percentile for all input values of x using the peritem weight w at the percentage p 
approx_percentile(x, w, percentage, accuracy)  Returns the approximate weighed percentile for all input values of x using the peritem weight w at the percentage p, with a maximum rank error of accuracy 
approx_percentile(x, w, percentages)  Returns the approximate weighed percentile for all input values of x using the peritem weight w at each of the given percentages specified in the array 
arbitrary(x)  Returns an arbitrary nonnull value of x 
array_agg(x)  Returns an array created from the input x elements 
avg(x)  Returns the average (arithmetic mean) of all input values 
bitwise_and_agg(x)  Returns the bitwise AND of all input values in 2’s complement representation 
bitwise_or_agg(x )  Returns the bitwise OR of all input values in 2’s complement representation 
bool_and(boolean )  Returns TRUE if every input value is TRUE, otherwise FALSE 
bool_or(boolean)  Returns TRUE if any input value is TRUE, otherwise FALSE 
checksum(x )  Returns an orderinsensitive checksum of the given values 
corr(y, x)  Returns correlation coefficient of input values 
count(* )  Returns the number of input rows 
count(x )  Returns the number of nonnull input values 
count_if(x )  Returns the number of TRUE input values 
covar_pop(y, x)  Returns the population covariance of input values 
covar_samp(y, x)  Returns the sample covariance of input values 
every(boolean)  Alias for bool_and() function 
geometric_mean(x )  Returns the geometric mean of all input values 
histogram(x)  Returns a map containing the count of the number of times each input value occurs 
kurtosis(x)  Returns the excess kurtosis of all input values 
map_agg(key, value)  Returns a map created from the input key / value pairs 
map_union(x<K, V>)  Returns the union of all the input maps 
max(x)  Returns the maximum value of all input values 
max(x, n)  Returns n largest values of all input values of x 
max_by(x, y)  Returns the value of x associated with the maximum value of y over all input values 
max_by(x, y, n )  Returns n values of x associated with the n largest of all input values of y in descending order of y 
min(x)  Returns the minimum value of all input values 
min(x, n )  Returns n smallest values of all input values of x 
min_by(x, y )  Returns the value of x associated with the minimum value of y over all input values 
min_by(x, y, n )  Returns n values of x associated with the n smallest of all input values of y in ascending order of y 
multimap_agg(key, value)  Returns a multimap created from the input key / value pairs. Each key can be associated with multiple values 
numeric_histogram(buckets, value)  Computes an approximate histogram with up to buckets number of buckets for all values 
numeric_histogram(buckets, value, weight )  Computes an approximate histogram with up to buckets number of buckets for all values with a peritem weight of weight 
regr_intercept(y, x)  Returns linear regression intercept of input values. y is the dependent value. x is the independent value 
regr_slope(y, x)  Returns linear regression slope of input values. y is the dependent value. x is the independent value 
skewness(x )  Returns the skewness of all input values 
stddev(x)  Alias for stddev_samp() function 
stddev_pop(x )  Returns the population standard deviation of all input values 
stddev_samp(x )  Returns the sample standard deviation of all input values 
sum(x)  Returns the sum of all input values 
var_pop(x)  Returns the population variance of all input values 
var_samp(x)  Returns the sample variance of all input values 
variance(x )  Alias for var_samp() function 
Similar to string functions in a database, you can use Athena String functions to manipulate data stored as character strings.
Since Athena is based on Presto, Athena String functions are a one to one match between the two. The table below lists string functions, and the Athena SQL syntax for it.
chr(n )  Returns the Unicode code point n as a single character string 
codepoint(string)  Returns the Unicode code point of the only character of string 
concat(string1, …, stringN)  Returns the concatenation of string1, string2, …, stringN 
from_utf8(binary)  Decodes a UTF8 encoded string from binary 
from_utf8(binary, replace )  Decodes a UTF8 encoded string from binary 
length(string)  This function returns the length of string in Athena 
levenshtein_distance(string1, string2)  Returns the Levenshtein edit distance of string1 and string2 
lower(string)  Converts string to lowercase 
lpad(string, size, padstring)  Left pads string to size characters with padstring 
ltrim(string)  Removes leading whitespace from string 
normalize(string)  Transforms string with NFC normalization form 
normalize(string, form)  Transforms string with the specified normalization form 
position(substring IN string)  Returns the starting position of the first instance of substring in string 
replace(string, search )  Removes all instances of search from string 
replace(string, search, replace )  Replaces all instances of search with replace in string 
reverse(string)  Returns string with the characters in reverse order 
rpad(string, size, padstring)  Right pads string to size characters with padstring 
rtrim(string )  Removes trailing whitespace from string 
split(string, delimiter )  Splits string on delimiter and returns an array 
split(string, delimiter, limit)  Splits string on delimiter and returns an array of size at most limit 
split_part(string, delimiter, index)  Splits string on delimiter and returns the field index 
split_to_map(string, entryDelimiter, keyValueDelimiter)  Splits string by entryDelimiter and keyValueDelimiter and returns a map. entryDelimiter splits string into keyvalue pairs 
strpos(string, substring )  Returns the starting position of the first instance of substring in string 
substr(string, start )  This Athena substring function returns a subset of a given string starting at position start 
substr(string, start, length )  If you want a specific number of characters (length) from a starting position, then use this alternate version of the Athena substring function 
to_utf8(string)  Encodes string into a UTF8 varbinary representation 
trim(string)  Removes leading and trailing whitespace from string 
upper(string )  Converts string to uppercase 
Athena Date Functions have some quirks you need to be familiar with.
%a  Abbreviated weekday name (Sun .. Sat)  %I  Hour (01 .. 12)  %r  Time, 12hour 
%b  Abbreviated month name (Jan .. Dec)  %i  Minutes, numeric (00 .. 59)  %s  Seconds (00 .. 59) 
%c  Month, numeric (0 .. 12)  %j  Day of year (001 .. 366)  %T  Time, 24hour 
%d  Day of the month, numeric (00 .. 31)  %k  Hour (0 .. 23)  %v  Week (01 .. 53) 
%e  Day of the month, numeric (0 .. 31)  %l  Hour (1 .. 12)  %W  Weekday name 
%f  Fraction of second  %M  Month  %Y  Year, numeric, four digits 
%H  Hour (00 .. 23)  %m  Month in numeric  %y  Year, numeric (two digits) [2] 
%h  Hour (01 .. 12)  %p  AM or PM  %r  Time, 12hour 
Function  Description 
current_date  Returns the current date as of the start of the query 
current_time  Returns the current time as of the start of the query 
current_timestamp  Returns the current timestamp as of the start of the query 
current_timezone( )  Returns the current time zone 
date_add(unit, value, timestamp)  Adds an interval value of type unit to timestamp 
date_diff(unit, timestamp1, timestamp2)  Returns timestamp2 – timestamp1 expressed in terms of unit 
date_format(timestamp, format)  Formats timestamp as a string using format 
date_parse(string, format)  Parses string into a timestamp using format 
date_trunc(unit, x)  Returns x truncated to unit 
day(x)  Returns the day of the month from x 
day_of_month(x)  This is an alias for day() 
day_of_week(x)  Returns the ISO day of the week from x 
day_of_year(x)  Returns the day of the year from x 
extract(field FROM x)  Returns field from x where field can be DAY, DAY_OF_MONTH, DAY_OF_WEEK, DAY_OF_YEAR, HOUR, MINUTE, MONTH, QUARTER, SECOND, TIMEZONE_HOUR, TIMEZONE_MINUTE, WEEK, YEAR, YEAR_OF_WEEK 
format_datetime(timestamp, format)  Formats timestamp as a string using format 
from_iso8601_date(string)  Parses the ISO 8601 formatted string into a date 
from_iso8601_timestamp(string)  Parses the ISO 8601 formatted string into a timestamp with time zone 
from_unixtime(unixtime)  Returns the UNIX timestamp unixtime as a timestamp 
from_unixtime(unixtime, hours, minutes)  Returns the UNIX timestamp unixtime as a timestamp with time zone using hours and minutes 
from_unixtime(unixtime, string)  Returns the UNIX timestamp unixtime as a timestamp 
hour(x)  Returns the hour of the day from x 
localtime  Returns the current time as of the start of the query 
localtimestamp  Returns the current timestamp as of the start of the query 
minute(x)  Returns the minute of the hour from x 
month(x)  Returns the month of the year from x 
now()  This is an alias for current_timestamp 
parse_datetime(string, format)  Parses string into a timestamp with time zone using format 
quarter(x)  Returns the quarter of the year from x 
Type  Function  Description 

Aggregate Function  * Refer Aggregate Functions tab  Agrregate Functions can be used as window functions by adding the OVER clause 
Ranking Function  cume_dist( )  Returns the cumulative distribution of a value in a group of values 
Ranking Function  dense_rank()  Returns the rank of a value in a group of values 
Ranking Function  ntile(n)  Divides the rows for each window partition into n buckets ranging from 1 to at most n 
Ranking Function  percent_rank()  Returns the percentage ranking of a value in group of values 
Ranking Function  rank( )  Returns the rank of a value in a group of values 
Ranking Function  row_number( )  Returns a unique, sequential number for each row, starting with one 
Value Function  first_value(x)  Returns the first value of the window 
Value Function  last_value(x)  Returns the last value of the window 
Value Function  nth_value(x, offset)  Returns the value at the specified offset from beginning the window 
Value Function  lead(x[, offset[, default_value]] )  Returns the value at offset rows after the current row in the window 
Value Function  lag(x[, offset[, default_value]])  Returns the value at offset rows before the current row in the window starting at 0 
DDL stands for Data Definition Language, and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.
Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar. Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For example, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.
The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.
1. ALTER DATABASE SET DBPROPERTIES  10. CREATE TABLE  19. SHOW COLUMNS 
2. ALTER TABLE ADD COLUMNS  11. CREATE TABLE AS  20. SHOW CREATE TABLE 
3. ALTER TABLE ADD PARTITION  12. CREATE VIEW  21. SHOW CREATE VIEW 
4. ALTER TABLE DROP PARTITION  13. DESCRIBE TABLE  22. SHOW DATABASES 
5. ALTER TABLE RENAME PARTITION  14. DESCRIBE VIEW  23. SHOW PARTITIONS 
6. ALTER TABLE REPLACE COLUMNS  15. DROP DATABASE  24. SHOW TABLES 
7. ALTER TABLE SET LOCATION  16. DROP TABLE  25. SHOW TBLPROPERTIES 
8. ALTER TABLE SET TBLPROPERTIES  17. DROP VIEW  26. SHOW VIEWS 
9. CREATE DATABASE  18. MSCK REPAIR TABLE 

DML stands for Data Manipulation Language and is a part of the Structured Query Language (SQL) class. In a relational database, every time a SELECT, INSERT, DELETE or UPDATE statement is executed you are manipulating data and thereby executing a DML statement.
When an Athena SQL DML statement is executed, it manipulates data stored in Amazon S3 (Simple Storage Service); therefore, support for DML statements like INSERT, DELETE, UPDATE and MERGE does not exist in Athena SQL. Currently, the only Athena SQL DML supported is the SELECT statement.
Related: CloudFormation Parameters: Make your life simple
Interact with Athena using shell commands from Windows PowerShell, Linux or remotely.
This is the latest user guide version of AWS Athena Documentation
Cookie  Duration  Description 

cookielawinfocheckboxanalytics  11 months  This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics". 
cookielawinfocheckboxfunctional  11 months  The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional". 
cookielawinfocheckboxnecessary  11 months  This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary". 
cookielawinfocheckboxothers  11 months  This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other. 
cookielawinfocheckboxperformance  11 months  This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance". 
viewed_cookie_policy  11 months  The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data. 