Share

Understanding Teradata Statistics Histograms: How the Optimizer Estimates Cardinality for WHERE Conditions

Understanding Teradata Statistics Histograms: How the Optimizer Estimates Cardinality for WHERE Conditions
tune3

Teradata Statistics Histograms - A Short Introduction

Many are familiar with the Optimizer's statistical confidence levels. I was recently surprised to discover that a "high confidence" rating does not guarantee a fully accurate estimation (provided the statistics collected are not stale). While I remain hopeful that my observations may be attributed to a bug, I wanted to share this information with you.

The Optimizer can utilize gathered statistics for predicates used in WHERE clauses in various ways.

They can be used alone or in conjunction with advanced methods like random AMP sampling (for detecting staleness) and extrapolation.


Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.

Get the next issue


The Optimizer can obtain accurate cardinality from the biased value histogram or multiple equal height histograms. However, if the histogram's most frequent value is not retrieved, the estimated row count may differ from the actual cardinality.

The most common value in every histogram with equal heights is stored along with its frequency, while other values in the histogram are computed.

Number of Other Values (i.e., not the most frequent in the histogram) / Distinct Other Values = Average Rows per Value

The next step involves determining the Average Rows per Value, the foundation for calculating cardinality. For instance, consider the following query:

SELECT * FROM TheTable WHERE COL1 = 5;

The histogram displays values 1 through 5. Values 1 through 4 have a frequency of 5, while the value 5 has a frequency of 30.

->  50 "Other values" (not being the most frequent of the histogram)
-> 5 distinct "other values"

The Average Rows per Value is 50/5 = 10.

In the example query, the Optimizer estimated the cardinality as 10 when it was 30.

Get the next issue by email.

Using any other WHERE condition that matches the same histogram, for example, SELECT * FROM TheTable WHERE COL1 = 1, will, fortunately, yield the accurate cardinality of 10.

However, we can anticipate that the estimated row count for statistics, which are not stale and relied upon by the Optimizer with a "high level of confidence," will closely approximate the actual number of rows.

Previously, I believed this to be the narrative. However, I was astounded when I sought to enhance the subsequent inquiry. Note that there were no archival data in the histogram at my disposal, and solely the statistics provided below were gathered:

COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;
SELECT * FROM TheDatabase.TheTable WHERE  COL1= 'ZZZ';

'ZZZ' was not found in the table. This information is also evident in the following SHOW STATISTICS output:

SHOW  STATISTICS VALUES COLUMN(COL1) ON TheDatabase. TheTable;

/** SummaryInfo **//* Version               */ 6,/* OriginalVersion       */ 6,/* DBSVersion            */ '14.10.06.03',/* UsageType             */ 'D',/* ComplexStatInfo       */ 'ComplexStatInfo',/* NumOfBiasedValues     */ 4,/* NumOfEHIntervals      */ 9,/* NumOfHistoryRecords   */ 0,/* HighModeFreq          */ 26412500,/* NumOfDistinctVals     */ 13,/* NumOfRows             */ 65057255,/** Biased: Value, Frequency **//*   1 */   'Text0', 253267,/*   2 */   'Text99', 26412500,/*   3 */   'Text25', 16767796,/*   4 */   'Text10', 21611177,
/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **//*     1   */ 'Text1', 'Text1', 55, 55, 0, 0,/*     2   */ 'Text2', 'Text2', 9840, 9840, 0, 0,/*     3   */ 'Text3', 'Text3', 2, 2, 0, 0,/*     4   */ 'Text4', 'Text4', 1965, 1965, 0, 0,/*     5   */ 'Text5', 'Text5', 1, 1, 0, 0,/*     6   */ 'Text6', 'Text6', 10, 10, 0, 0,/*     7   */ 'Text7', 'Text7', 4, 4, 0, 0,/*     8   */ 'Text8', 'Text8', 3, 3, 0, 0,/*     9   */ 'Text9', 'Text9', 635, 635, 0, 0

COL1='ZZZ' is not hitting a biased value (Text0, Text99, Text25, Text10) or any of the 9 equal height histograms. I would have expected that the cardinality estimation, therefore, would be 0 rows, but executing the EXPLAIN statement, I saw this:

Explain SELECT * FROM TheDatabase. TheTable WHERE  COL1 = 'ZZZ'

1) First, we lock a distinct TheDatabase."pseudo table"for read on a RowHash to prevent global deadlock forTheDatabase.TheTable.2) Next, we lock TheDatabase.TheTable for read.3) We do an all-AMPs RETRIEVE step fromTheDatabase.TheTable by way of an all-rowsscan with a condition of ("TheDatabase.TheTable.COL1 = 'XYZ'") intoSpool 1 (group_amps), which is built locally on the AMPs. Theinput table will not be cached in memory but is eligible forsynchronized scanning. The size of Spool 1 is estimated with highconfidence to be 5,004,405 rows (3,337,938,135 bytes). Theestimated time for this step is 37.22 seconds.4) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 1 are sent back to the user as the result ofstatement 1. The total estimated time is 37.22 seconds.

The estimation was about 5 Million rows with high confidence!

This appears peculiar to me as the Optimizer should know that none of the histograms contain the designated value. The projected cardinality must be zero.

The Optimizer performs the following calculation:

The number of Table Rows (65,057,255) / Number of Distinct Values (13) = 5,004,405 (The Number of table rows and distinct values you can see in the above SHOW STATISTICS output).

The rule is: We have, on average, 5,004,405 rows per value. As I don't have the WHERE condition value = 'ZZZ' in my histograms, I will use the average value for my estimations. In my example, this leads to an entirely wrong estimate, even with "high confidence".

I am even more concerned that the margin of error increases proportionally to the size of the table.

Selecting an available column value and a non-existent value from the same column will limit the total estimated cardinality to the number of table rows. This is a positive message to take away.

I hope this issue is a bug. If you have further information, kindly share it by commenting. Thank you!


Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe before the paid plan launches and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Subscribe to DWHPro Letters

Practical field notes on enterprise data engineering, production AI systems, platform migration, and the senior engineering market.
Written by Roland Wenzlofsky Founder of DWHPro Author of Teradata Query Performance Tuning
Get the next issue
Subscribe