Share

Teradata GROUP BY vs. DISTINCT- Aggregation Methods

Learn about the differences between GROUP BY and DISTINCT for aggregations in Teradata. The optimizer now selects the appropriate algorithm depending on data demographics. Understand when to use each method to optimize query performance.

Teradata GROUP BY vs. DISTINCT-  Aggregation Methods
sql4

Introduction

When performing data analysis, choosing between GROUP BY and DISTINCT for aggregations is often necessary. While both approaches yield similar results, earlier versions of Teradata allowed users to select the more efficient option based on specific query data characteristics. This was possible because those versions used distinct algorithms for each method.

This article has been updated to reflect advancements in technology. The Teradata Optimizer removes the need to choose between GROUP BY and DISTINCT. This article outlines the two algorithms and how the optimizer selects between them.

Earlier versions of Teradata did not pre-aggregate data when using DISTINCT for aggregation. Instead, the rows intended for final aggregation were repeatedly redistributed to the target AMPs.


Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.

Get the next issue


Teradata pre-aggregates data and distributes it among target AMPs when using GROUP BY aggregation.

Get the next issue by email.

As noted above, the fixed algorithm assignment for aggregation syntax (DISTINCT vs. GROUP BY) has been discontinued. The Teradata Optimizer now determines the appropriate algorithm based on data demographics.

For DISTINCT, rows are immediately redistributed without pre-aggregation. For GROUP BY, a preliminary pre-aggregation step occurs before redistributing distinct values among the AMPs.

We must consider how data demographics affect the algorithm choice. It is straightforward: pre-aggregation is advisable if the aggregated columns contain a small number of distinct values. Alternatively, distributing and aggregating rows directly is often the better option when there are many distinct values.

When does the Teradata Optimizer use which method?

Pre-aggregation offers the advantage of localized AMP processing and reduced BYNET distribution of rows. However, this method can be disadvantageous when columns contain highly diverse values, as each AMP must sort and eliminate duplicates based on ROWHASH. Pre-aggregation is least beneficial when columns have many distinct values.

Distributing rows directly to the target AMPs is beneficial when columns have highly varied values.

The optimizer must also account for skewed aggregated columns, which may cause certain AMPs to run out of spool space if rows are redistributed too quickly. In such cases, performing a pre-aggregation can be beneficial.

Fortunately, the question of whether to use GROUP BY or DISTINCT no longer requires manual consideration. However, collecting statistics on the aggregated columns is essential so that the optimizer can identify any imbalances and choose the appropriate method.

Summary

We outlined the decision criteria for the optimizer's aggregation method selection. Our responsibility is to provide the optimizer with the necessary statistics so it can make the right choice.

  • GROUP BY    for many duplicates or skew
  • DISTINCT      for a few duplicates only and no skew

Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe before the paid plan launches and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Subscribe to DWHPro Letters

Practical field notes on enterprise data engineering, production AI systems, platform migration, and the senior engineering market.
Written by Roland Wenzlofsky Founder of DWHPro Author of Teradata Query Performance Tuning
Get the next issue
Subscribe