Doing Teradata Hashing The Right Way
This article presupposes your existing familiarity with the fundamental structure of a Teradata System.
As you are aware, the AMPs operate on a Teradata System.
The number of AMPs responsible for managing rows stored on their respective virtual disks varies based on the system's size. There may be hundreds of AMPs.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
As noted in my article about the Teradata High-Level Architecture, the primary goal is to distribute rows evenly among all AMPs in order to achieve parallelism.
How Does Teradata Achieve Uniform Data Distribution?
What are the methods for achieving uniform data distribution on Teradata?
It depends on the Primary Index.
Get the next issue by email.
The Primary Index and Primary Key are distinct concepts. The Primary Key is used in data modeling, whereas the Primary Index is a physical design principle in Teradata.
Primary Index vs. Primary Key
Please note that the following paragraph is a simplified example but should provide an adequate understanding of the general concepts.
The Primary Index comprises table columns used as input for a highly efficient hashing algorithm. The algorithm's output designates a responsible AMP that assumes all tasks pertaining to the corresponding row. The designated AMP stores the row on its associated virtual disk and becomes solely accountable for handling it thereafter.
The order of columns passed to the hashing algorithm is inconsequential. However, it's crucial to consider data types, as compatibility is essential — dissimilar data types yield different results.
Data Distribution Strategy
The data distribution strategy employed is both straightforward and effective. A designated portion of the data is allocated to each AMP, with the distribution of selected rows solely determined by the Primary Index. All tables are accessible on the Teradata System and are duly recorded and updated by each AMP.
The AMP with the most rows to handle will determine the overall response time for any DML statement. Evenly distributing rows across all AMPs will ensure linear scalability, which should be the primary focus when designing the physical data model.
Teradata's Primary Index concept facilitates direct access to data based on a hash value, similar to other hashing algorithms. Utilizing Primary Index access is the most expeditious method of retrieving rows from disks.
Changing the Primary Index
Changing the Primary Index columns on a row will result in a handover of the row to another responsible AMP, which is expensive. Therefore, it is recommended to avoid altering the PI.
Primary Indexes are classified as UNIQUE or non-UNIQUE. Records that share the same content are assigned identical hash values. Additional information, such as a uniqueness value, is added to differentiate these records. However, both records are managed by the same AMP.
While the data distribution process involves additional details, the preceding description should suffice for your everyday use of Teradata.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.