Mastering Teradata Performance Optimization: Introduction to Partial Redistribution and Duplication
Introduction to Teradata Partial Redistribution and Partial Duplication
Managing data skew is a critical aspect of optimizing Teradata performance. While table-level skew can typically be prevented by selecting an appropriate Primary Index, spool skew occurs during query execution - a common and often significant problem.
Spool skew is a common issue when joining tables in Teradata. The join process is executed simultaneously by all available AMPs in parallel.
Let us consider the following example of a join:
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
B ... Big table with 100 Million rows
S ... Small table with 10,000 rows
Assuming that the two tables are joined on a column that is not the primary index for either table, the optimizer may rehash and redistribute both tables. This approach is ideal if the join column values are evenly distributed across all AMPs.
We assume that in the big table B, the value in the join column is the same for approximately 99% of the rows. This equates to 99 million out of 100 million rows sharing the same value in the join column.
Get the next issue by email.
Currently, we are experiencing a significant performance issue due to data skew.
Rehashing the rows assigns 99 million rows to a single AMP because they share the same hash value.
The AMP designated for receiving may experience a Hot AMP situation and deplete its spool space.
Before Teradata Release 13, a performance specialist's sole responsibility was to address dynamic skew issues by manipulating the optimizer to adopt an alternate execution plan. This issue could occasionally be resolved by incorporating statistics, while in other cases, the only viable solution was to rewrite the SQL statement fully.
Teradata 14 introduced a refined approach to resolve data skew problems.
The AMP is informed of any skewed values pertaining to the big Table B.
Skewed rows in Table B remain confined to their respective AMPs. Non-skewed rows are redistributed by rehashing the join column(s).
Non-biased rows from small table S are rehashed on their join column(s), while rows that would become skewed during redistribution are duplicated across all AMPs.
Upon completing the partial redistribution and duplication of the tables, each AMP proceeds to generate the initial result by merging the contents of the redistributed spools.
Each AMP produces a secondary output by merging table S's replicated spool contents with table B's local spool.
With Teradata 14, PRPD alleviates certain performance issues related to skew. However, the Teradata optimizer requires knowledge of biased values, underscoring the importance of maintaining up-to-date statistics.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.