Share

DISTINCT vs. GROUP BY: Understanding Performance Differences Based on Data Demographics in Teradata

DISTINCT vs. GROUP BY: Understanding Performance Differences Based on Data Demographics in Teradata
sql4

Over the years, numerous debates have emerged concerning the superior performance of specific statements:

SELECT <COLUMN> GROUP BY 1 or
SELECT DISTINCT <COLUMN>

Many personal experiences are often shared but tend to misattribute causality. People usually construct a single test scenario and extrapolate sweeping conclusions from it.

The speculation has ceased. Herein lies the truth:


Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.

Get the next issue


The validity of these statements hinges on the data's demographic composition.

Get the next issue by email.

Grasping the execution of each statement is crucial for discerning the appropriate use of DISTINCT versus GROUP BY in Teradata.

DISTINCT distributes data to the responsible AMPs and eliminates duplicates, while GROUP BY performs local grouping on the AMP before distributing the remaining rows.

Once the basic principles are understood, it becomes simple to identify the appropriate statement to use on a Teradata system.

Using AMP local aggregation is not beneficial if there are many unique values in the columns used for grouping. Instead, it is recommended to use the DISTINCT statement.

To reduce the number of rows transferred to the AMPs during the final aggregation step, it is advisable to employ the GROUP BY statement when there are only a handful of distinct values in the grouping columns. This scenario triggers the AMP local grouping step.

One comment:

A high skew on grouped columns can cause an "out of spool space" situation on a local AMP due to the movement of many rows to a single or few AMPs. In this particular scenario, it is recommended to use the GROUP BY statement instead of the DISTINCT statement, which is typically preferred.

I hope most of your questions have been answered. There is no clear winner between DISTINCT and GROUP BY.


Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe before the paid plan launches and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Subscribe to DWHPro Letters

Practical field notes on enterprise data engineering, production AI systems, platform migration, and the senior engineering market.
Written by Roland Wenzlofsky Founder of DWHPro Author of Teradata Query Performance Tuning
Get the next issue
Subscribe