Teradata Join Strategy: The Benefits and Costs of Merge and Product Join Methods

tune4

Teradata employs various join methods and techniques to merge the rows of two tables onto a single AMP, which is essential for joining. The combination of join technique and data geography is called the join strategy, and its goal is to reduce resource consumption (CPU seconds, I/Os). Some join methods, like the Teradata Merge Join, have highly efficient algorithms, but preparing them can be expensive, particularly when merging tables with different primary indexes or partitioning. Other join methods are less efficient algorithmically but do not incur any preparation costs, such as those related to join geography.

We should consider the following costs in the join strategy:

The number of required comparisons between the rows of both tables.
How often must each data block of both tables be loaded from disk into main memory during the join process?
The amount of data that must be copied by duplicating or redistributing between AMPs.
The number of rows of both tables that need to be sorted (if any).

The merge join is highly efficient, with minimal comparisons required after sorting the rows on the common AMP. This applies only when both tables have identical primary indices and partitions; otherwise, redistributing and sorting one or both tables may be necessary depending on the specific situation. For row-partitioned tables, it may be necessary to convert one or both to non-partitioned tables, or to partition non-partitioned tables. These preparation steps for the join can be quite expensive.

Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. DWHPro Letters is free. Subscribe to get new issues by email.

Get the next issue

The product join is less costly to prepare than other types of joins. However, it requires many comparisons because every row in one table must be compared to every row in the other. Generally, the only preparation required for a product join is duplicating the smaller of the two tables to all AMPs. Rows do not need to be sorted and can be compared immediately.

The cost of the product join increases when a small table is erroneously identified as a large table due to absent or outdated statistics. This may cause the smaller table to exceed the available memory, requiring Teradata to repeatedly retrieve data blocks from disk into main memory.

Recapping the cost characteristics of the product join:

High costs for comparison
Low costs for join preparation

To summarize the cost characteristics of the merge join:

No costs (if the primary index and partitioning of both tables are the same) to very high costs for join preparation
Low costs for comparison

Let's proceed to our test scenario to demonstrate that the product join is not always the wrong choice. When an execution plan involves a product join, it's essential to consider the following factors:

Get the next issue by email.

Is it an unwanted product join, e.g., because of a wrong alias or a forgotten join condition?
Is at least one of the two tables very small?

If one of the two tables is small, a product join may perform better than a merge join. Remember that Teradata can also apply dynamic partition elimination to product joins.

The Teradata Merge Join

Here is a query that can be resolved with a merge join:

SELECT Customer.ISO, Product.DescFROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCd;

4)We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with no residual conditions into Spool 2, redistributed by hash code to all AMPs. Then, we do a SORT to order Spool 2 by row hash. The result spool file will not be cached in memory. The size of Spool 2 is estimated with low confidence to be 10,000,100 rows. The estimated time for this step is 30 minutes and 12 seconds.5) We do an all-AMPs JOIN step from Customer by way of a RowHash match scan with no residual conditions, which is joined to Spool 2 (Last Use). Product and Spool 2 are joined using a merge join, with a join condition of ("Product.ProductCd=ProductCd"). The result goes into Spool 1, which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 10,000,000 rows. The estimated time for this step is 11 minutes and 20 seconds.

The execution plan shows that join preparation will take around 30 minutes, followed by an estimated 11 minutes for the join operation itself. These estimates align with the query's actual runtime. It is important to note that redistributing and sorting all table rows is necessary to ensure the primary index of each table aligns, so the rows end up on a common AMP.

The Teradata Product Join

We improve the query by adding a WHERE condition, prompting the optimizer to use a product join.

SELECT Customer.ISO, Product.DescFROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCdWHERE Product.Price <= 1000;

4) We do an all-AMPs RETRIEVE step from Product by way of an all-rows scan with a condition of ("Product.Price <= 1000") into Spool 2, which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 100 rows. The estimated time for this step is 0.15 seconds.

5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to Customer. Spool 2 and Customer are joined using a product join, with a join condition of ("ProductCd = Customer.ProductCd"). The input table Customer will not be cached in memory, but it is eligible for synchronized scanning. The result goes into Spool 1, which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 800,000 rows. The estimated time for this step is 18 minutes and 17 seconds.

The updated execution plan projects a sub-one-second duration for copying the small table to all AMPs. The join is estimated to take approximately 18 minutes. These projections align closely with the actual runtimes.

The Product Join performs significantly better than the Merge Join.

The WHERE condition facilitates a product join in this example. However, the optimizer may instead choose to duplicate or redistribute the smaller table across all available AMPs, sort it, and execute a merge join via a rowhash match scan. This method incurs greater expense than joining the unsorted table. If Teradata were to join the tables through columns that lack a primary index, the optimizer would be forced to sort both tables, which is exceedingly costly for large tables.

Summary

While seeing a product join in an execution plan may be alarming, it should not automatically be deemed unfavorable. It is worth noting that this join type is universally applicable, unlike others. In certain cases, such as when OR conditions are present in the join, the optimizer may have no option but to use a product join.

If you're interested in join techniques and tactics, I suggest reading this article as well:

https://letters.dwhpro.com/content/files/2026/05/teradata-nested-join-performance-1.html

To learn more about product joins, we suggest referring to the official Teradata documentation.

Official Teradata Docs: The Product Join

Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe for free and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

The Teradata Merge Join

The Teradata Product Join

Summary

Subscribe to DWHPro Letters