How to Reduce Skew in Teradata Joins Using the RANDOM() Function
Natural skew may hinder join performance, but the PDPR capability in Teradata 14 can alleviate this problem. However, PDPR may not detect all potential applications if skewed values are not evident in the statistics.
If your Teradata version is below 14, PDPR is unavailable. To address this, you can use a technique based on the RANDOM() function to reduce skew.
Teradata Joins are vulnerable to Skewing.
Let's walk through the specifics step by step.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
Our example involves a join between the two tables shown below. Certain facts are already established about these tables:

- Table 1 contains no rows with 'A' in the join column. Furthermore, it's ensured that the value 'A' will never be added to this column.
- The join column of Table 2 contains many rows with the value 'A'. It's skewed (our example tables only include a few rows, but that's enough to get the idea).
- Both tables are evenly distributed across all AMPs on their primary index column (called "primary key" in our example).
- There exists a range of join column values that are never present in the regular data of Table 2 (we will explain later why this is a requirement).
The above facts are necessary for our testing framework to operate.
Get the next issue by email.
By performing a left join of Table 1 and Table 2 on the common column, both tables' rows are hashed on this column. Assuming Teradata redistributes both tables, this is exemplified.
The informed reader understands Teradata's architecture and will recognize that all rows containing 'A' in the join column will be relocated to a single AMP, causing Table 2's spool to be imbalanced. Conversely, the redistribution of Table 1 should yield an evenly distributed spool. When joining, the AMP holding Table 2's rows with 'A' will bear a greater workload than the other AMPs.
We can use the RANDOM() function to prevent skewing in this scenario.
Table 1 does not contain any rows with the value 'A' in the join column, meaning there will be no matches for any row in Table 2 containing 'A'.
We can easily replace 'A' in Table 2 with a range of values unavailable as regular data (see requirement four above). This is where the RANDOM() function comes into play.
We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which replace the 'A' values of Table 2. As these values are non-existent in regular data, this will not affect the result set:
CREATE VOLATILE TABLE NOT_SKEWING_TABLE2AS(SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2) PRIMARY INDEX (PrimaryKey);We will join NOT_SKEWING_TABLE2 with Table 1 instead of Table 2. Redistributing NOT_SKEWING_TABLE2 will not result in a skewed spool because the randomly generated values will be evenly distributed across all AMPs.
By implementing this technique, our workload will be uniformly distributed among all AMPs, resulting in improved performance for the join operation. Applying this approach reduced the query's runtime from 4 hours to just 3 minutes.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.