How Teradata Optimizer Uses Multi-Column Statistics

tune1

A recent question came in about how the Teradata Optimizer uses multi-column statistics.

Here are the essential details:

The Optimizer uses multi-column statistics when the query's WHERE clause covers all columns.

Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. DWHPro Letters is free. Subscribe to get new issues by email.

Get the next issue

COLLECT STATISTICS ON DB.THE_TABLE COLUMN (first_column,second_column);EXPLAINSELECT *  FROM DB.THE_TABLE WHERE first_column ='A'  AND second_column = 'B';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way ofan all-rows scan with a condition of ("(DB.THE_TABLE.second_column ='B ') AND (DB.THE_TABLE.first_column = 'A ')") into Spool 1(group_amps), which is built locally on the AMPs.  The size ofSpool 1 is estimated with low confidence to be 669,939 rows (1,426,300,131 bytes).  The estimated time for this step is 9.58seconds.

This example pertains to Teradata 13.10. The query was executed without gathering Primary Index statistics, resulting in low confidence from the Optimizer. To boost confidence to a high level, collecting Primary Index statistics is necessary. These statistics provide the Optimizer with details on table cardinality.

Primary Index statistics are unnecessary in Teradata 14.10 due to the availability of summary statistics that include table cardinalities. These statistics are obtained automatically when any statistics are collected. To add summary statistics in Teradata 14.10 without collecting statistics, execute the following statement to achieve a high level of confidence:

HELP STATS DB.THE_TABLE; -- Adds summary statistics for free…

Changing the order of statistic columns has no impact on Teradata 13.10 or Teradata 14.10. The example below delivers the same result as the previous one:

Get the next issue by email.

DROP STATISTICS ON DB.THE_TABLE;COLLECT STATISTICS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAINSELECT *  FROM DB.THE_TABLE WHERE first_column ='A'  AND second_column = 'B';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of ("(DB.THE_TABLE.second_column =
'B ') AND (DB.THE_TABLE.first_column = 'A ')") into Spool 1
(group_amps), which is built locally on the AMPs.  The size of
Spool 1 is estimated with low confidence to be 669,939 rows (
1,426,300,131 bytes).  The estimated time for this step is 9.58
seconds.

Can the Optimizer use multi-column statistics if a query only references some of the WHERE condition's columns?

In Teradata 13.10, the Optimizer can only use a multi-column statistic when all columns in the WHERE condition are referenced in the query.

The Optimizer's heuristics assume that Teradata will retrieve 10% of the table's rows in the following example:

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = 'A';

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way of
an all-rows scan with a condition of ("DB.The_Table.first_column =
'A'") into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with no confidence to be
669,939 rows (1,426,300,131 bytes). The estimated time for this
step is 9.58 seconds.

Teradata 14.10 introduces substantial enhancements that help the Optimizer make more informed decisions. Specifically, the Optimizer uses multi-column statistics when columns are referenced, but only if the referenced columns are located at the start of the statistical definition.

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = 'A';

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way ofan all-rows scan with a condition of ("DB.The_Table.first_column ='A'") into Spool 1 (group_amps), which is built locally on theAMPs. The size of Spool 1 is estimated with low confidence to be309,034 rows (333,447,686 bytes).  The estimated time for thisstep is 0.09 seconds.Here is an example where the Optimizer can't use the multi-column statistics: The referenced column is not located at the beginning of the definition of the multi-column statistics:

COLLECT STATS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAIN SELECT * FROM DB.THE_TABLE WHERE first_column = 'A';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way ofan all-rows scan with a condition of ("DB.THE_TABLE.first_column ='A'") into Spool 1 (group_amps), which is built locally on theAMPs. The size of Spool 1 is estimated with no confidence to be61,853 rows (66,739,387 bytes).  The estimated time for this stepis 0.05 seconds.

I hope this post has clarified how multi-column statistics are used. Feel free to ask if anything remains unclear.

Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe for free and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Can the Optimizer use multi-column statistics if a query only references some of the WHERE condition's columns?

Subscribe to DWHPro Letters