The Impact of Skewing on Teradata demonstrated
Learn how skewing on Teradata can impact query run times with this informative test scenario. Find key figures and observations to help optimize your performance.
Teradata professionals inevitably encounter the detrimental effects of data skew on query performance.
Yet, I believe only a handful of us know the precise extent of impact we should anticipate.
I sought out crucial data points and created a test scenario utilizing precise measurements to gain a comprehensive understanding.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
I am an enthusiast of CPU utilization and disk activity, as opposed to run times, as these metrics consistently reflect accurate information. Obtaining this data is straightforward through DBC.DBQLOGTBL, unless restricted by database administrators in adverse scenarios.
I conducted tests on both SET and MULTISET tables to broaden the test scenario. The rationale behind incorporating this attribute at the table level was to:
Considering the impact of skewing on performance, wouldn't it be necessary to impose duplicate row checks on a SET table, thereby increasing the workload on the AMPs affected by skewing?
The configuration of my test environment is as follows:
CREATE SET TABLE TABLE_NOT_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE MULTISET TABLE TABLE_NOT_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (A) ;
CREATE SET TABLE TABLE_SKEWED (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
CREATE MULTISET TABLE TABLE_SKEWED_MULTISET (A INTEGER, B INTEGER) PRIMARY INDEX (B) ;
Get the next issue by email.
I generated both a skewed and unskewed table as both SET and MULTISET. I populated them with data, wherein Column A is nearly unique, and Column B contains only a handful of distinct values.
INSERT INTO TABLE_NOT_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_NOT_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
INSERT INTO TABLE_SKEWED_MULTISET SELECT RANDOM(1,1000000), RANDOM(1,5) FROM SYS_CALENDAR.CALENDAR;
After careful analysis, I examined DBC.DBQLOGTBL:
SELECT QUERYTEXT, MAXAMPCPUTIME, MINAMPCPUTIME, MAXAMPIO, MINAMPIO, TOTALIOCOUNT, AMPCPUTIME FROM DBC.DBQLOGTBL WHERE SESSIONID = ; -- my SQL Assistant session
Below you can see the result of this test. Here are my observations:| QueryText | MaxCPU | MinCPU | MaxIO | MinIO | TotalIO | AMPCPU |
| INSERT INTO TABLE_NOT_SKEWED… | 0,01 | 0 | 41 | 39 | 3.619 | 0,33 |
| INSERT INTO TABLE_NOT_SKEWED_MULTISET… | 0,02 | 0 | 38 | 40 | 3.534 | 0,44 |
| INSERT INTO TABLE_SKEWED… | 8,59 | 0 | 41.536 | 13 | 204.651 | 42,44 |
| INSERT INTO TABLE_SKEWED_MULTISET… | 0,03 | 0 | 86 | 11 | 1.350 | 0,33 |
SET TABLES
The skewed SET table requires vastly more disk accesses compared to the non-skewed table. This means the database system takes significantly longer to retrieve data from the skewed table than from the non-skewed table.
The most active AMP must perform 4000 times as many disk accesses as the least busy AMP while utilizing similar CPU seconds.
MULTISET TABLES
Multiset tables require fewer disk accesses than their set table counterparts because the expensive duplicate row check is omitted.
The skewed multiset table requires fewer disk accesses than the unskewed table by less than 50%. However, this does not necessarily indicate that it is superior.
Most rows are likely adjacent on a small number of AMPs, which reduces disk access by utilizing memory caching and optimizing block reading methods.
Unfortunately, only a limited number of AMPs are assigned to manage numerous rows, resulting in a bottleneck. Therefore, it would be unwise to be deceived by these figures as inserting into the table remains significantly slower than when undertaken by a single or a select few AMPs.
Use multiset tables when available to minimize table skew and reduce the negative effects on duplicate row checks.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.