obstkel.com logo

Redshift Coalesce: One for all and all for NULL

Redshift coalesce

Coalesce means to combine!

Redshift Coalesce is a conditional expression which returns the first non-null value from multiple input values.  The beauty of coalesce is, there is no limit to the number of values you can input. It’s a complex sounding word, but one of the most helpful expressions in Redshift.

The syntax for the coalesce function is as below.

coalesce (expression1, expression2, expression3.......)

If you pay close attention, the coalesce function is pretty much an abbreviated version of an if/else statement. The above statement would break down as 

if expression1 is null then set to expression2 else 
if expression2 is null then set to expression3........

 
So, when should we use the Coalesce expression?

  • Incomplete source data – Scenarios where the data is not always populated. In such cases you may need to add a secondary logic to make the data complete.
  • Derived field – Certain business rules might require field data to be derived based on multiple source fields. 
  • Nulls not allowed in target – This is one of the most common use cases for using coalesce. 


As a matter of personal preference, if you don’t like the sound of “coalesce” or think it’s too much to remember, don’t worry. You can use the NVL expression. COALESCE and NVL do the exact same thing.

For now, let’s look at 3 nuances of the coalesce function – data types, null values and expression priority. Mastering these will have you using coalesce much more effectively.

Data type impact on Redshift Coalesce

The coalesce expression is data type independent. This means you can use this expression on String, Date and Numerical values as long as all the inputs are of the same type.  The last part needs some clarification.

For example, you cannot have one input field(expression1) of type date, and another one (expression2) of type number.  All source data fields, in the coalesce function should have the same data type.

To keep things simple, I will be referring to CHAR/VARCHAR data types, combinedly as string(s).

Tip #1: The target field data type is the driver – When using the Redshift Coalesce function, make sure all the source data types are the same as the target data type. If they are not, then use a function to convert them to the target data type.

Now let’s look at a few examples. Take a look at the table below and see if come to the same conclusion.

redshift coalesce datatype
  1. The target data type is DATE. Expression1 and 3 are of type DATE. Expression2 is has a value of null. This combination works.

     

  2. The target data type is NUMERIC. Expression1 works, and expression2 is okay since it’s a NULL value. However, expression2 is a date type. This will not work.

     

  3. The target data type for #3 is a VARCHAR. All 3 expressions are in string format and are characters. So, this works.

  4. This example has a target field of type NUMERIC. Expression1 and expression2 looks great. However, expression3 is a number in string format. This combination will not work. But if you were to exclude the quotes around expression3, we have a winner! 


    Related: 10 Redshift Create Table Examples to make you an expert 

 

Table of Contents

Null values - the nemesis of Redshift Coalesce?

The Redshift Database Developer Guide defines a null as “If a column in a row is missing, unknown, or not applicable, it is a null value or is said to contain null”.

A simplified version, a null value in a database denotes the absence of data. DATE and NUMERIC data types are the cleanest when it comes to determining null/not null cases. You either have a valid value or you have a null value.  The complexity arises when the data is stored in a VARCHAR/CHAR datatype field. 

A space or blank in a CHAR/VARCHAR field is not the same as null value. Space might not be the value you are looking for, but that does not mean it is a null value.

Likewise, a zero in a numeric field is not the same as a null value.

An incorrect date like 30 February 2022 might be okay in a CHAR/VARCHAR field but will be a null value in a DATE data type field.

In some instances, we may even have junk/garbage characters in a string field.

How, then do we make sure a source field is actually a pure null?

A clever move would be to write a SELECT with a TRIM function and an IS NULL in sql as shown below.

Tip#2: Determine if the source field contains null value(s) using IS NULL in SQL.
     SELECT * FROM orders where TRIM(delivery_status) IS NULL;

Alternately, if you suspect the data has junk/garbage characters, you can add the REGEXP_REPLACE function to the above IS NULL SQL statement as below. 

The function as written, replaces all occurrences of numbers ‘0-9’, upper case letters ‘A-Z’ and lowercase letters ‘a-z’ with a null value. If a junk/garbage character exists in the field, the SELECT statement returns a 1.

Tip#3: Use REGEXP_REPLACE and NOT NULL to determine if a field contains junk/garbage characters.

SELECT 1 FROM orders where
REGEXP_REPLACE(delivery_status, '[[A-Z,a-z,0-9]]*','') IS NOT NULL;

Prioritizing expressions in Coalesce

Moving on to the last and final topic on Coalesce – expressions. 

The source fields you choose to be expression1, expression2, expression3 and so on makes a significant difference in the value returned by the Redshift Coalesce function. A technique to ensure you do not end up with incorrect values is to prioritize your expression selection. 

How do you do that?

  • Let the target field requirement be the driver
  • Use fields in the order of decreasing data quality, i.e., highest to lowest data quality


Just like you send in the best players first in football, make sure the field with the best data quality is used first. The image below depicts the order in which it is best to select fields in a Coalesce expression.

redshift coalesce expression priority

Redshift Coalesce examples

Let’s look at a few of Redshift Coalesce examples using date, string and numeric data. If you would like to see more variations on coalesce examples, sent us an email at info@obstkel.com

1.Redshift Coalesce in sql example using dates

In this example #2, let’s look at using coalesce on 3 input date columns. Remember in this context column and expression means the same.

  • expression1   order_delivered, a column in a table of type DATE.
  • expression2 – expected_delivery date, a column of type VARCHAR with date stored in format YYYYMMDD.
  • expression3current_date function.
SELECT coalesce ( order_delivered, to_date(expected_delivery,'YYYYMMDD'), current_date) FROM orders;

2.Redshift Coalesce in sql example using strings

In this example #3, let’s look at using coalesce on 3 CHAR/VARCHAR columns. The goal is to determine the email address using a combination of existing data and deriving it if necessary.

  • expression1   email_address, a column in a table of type VARCHAR used to store primary email.
  • expression2 – email_address2, a column of type VARCHAR used to store secondary email addresses.
  • expression3 – This is a concatenation of 2 fields, first_name and last_name. Both fields are stored in the customers table as type VARCHAR.
SELECT coalesce ( email_address,trim(email_address2), last_name + first_name'@obstkel.com') FROM customers;

3.Redshift Coalesce in sql example using numeric data

In this 4th example, let’s look at using coalesce on 3 numeric columns. The goal is to determine the price an item was sold at.

  • expression1 – sale_price, the price at which an item was sold if it was part of a sale. This field is stored as a decimal type.
  • expression2 – recommended_sale_price. This is potential price an item should be sold, should it be part of a sale. This price is stored as a VARCHAR.
  • expression3   current_price, a column in a table of type DECIMAL used to store the current price of an item.
SELECT coalesce ( sale_price, cast(recommended_sale_price as decimal), current_price) FROM items;

In case you are wondering why the columns were listed in that order, go back and review the section on ‘Prioritizing expressions in Coalesce’. 

For this specific example, we are trying to determine the price at which an item was sold.

The sale_price has the highest priority in this context and becomes the first expression. If the item was not listed for sale up to that point, sale_price will be null. Our next best bet is to use the recommended_sale_price, which is stored as a string. You can use either the TO_NUMBER function or the CAST function.

In this case I choose the CAST function to convert the recommended sale price to a decimal. If both the sale_price and recommended_sale_price is null, then we set the item price to the current_price.

Conclusion

Redshift Coalesce is a powerful and straightforward conditional expression. Just make sure you pay attention to the below listed details. 

  • Target column data type – let this field be your driver.
  • Source column data type(s) and how they are stored.
  • Quality of data in source columns.
  • Order of listing expressions in coalesce.

Recent Posts

Interested in our services ?

email us at : info@obstkel.com

Copyright 2022 © OBSTKEL LLC. All rights Reserved