Share

The Importance of Avoiding SQL Functions on WHERE Condition Columns for Optimizer Performance

Learn why it's important to be careful when applying SQL functions on WHERE condition columns and how it can impact query performance.

The Importance of Avoiding SQL Functions on WHERE Condition Columns for Optimizer Performance
tune2

It is important to exercise caution when utilizing SQL functions on WHERE condition columns as they can limit the Optimizer's ability to employ available statistics. Without collected statistics, estimations are conducted via either random AMP sampling (for indexed columns) or heuristics (for non-indexed columns). A notable disadvantage of heuristics is discussed in the following article: Teradata Heuristics.

Incorrect physical data models often use functions on columns within WHERE conditions. While minimizing costs is important, one should also consider the adage that purchasing something cheaply may ultimately result in higher costs.

The following example violates basic normalization rules:


Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.

Get the next issue


The CustomerType column contains dual information and requires division into two distinct columns.

COLLECT STATISTICS ON Customer COLUMN(CustomerType);SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) || SUBSTR(CustomerType,10,4) = ‚GOLDPRIV';

The Optimizer cannot utilize the gathered statistics for the CustomerType column and must instead rely on basic heuristic approximations.

Assuming our Customer table has 100,000 rows and we have 100 customers with a "Gold" status in the "Private" category, the execution plan will appear as follows:

"We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
("((SUBSTR(Customer .ProductCode,1,4))||(SUBSTR(Customer .ProductCode,10,4)))=‚GOLDPRIV'")
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows."

The Optimizer assumes a 10% row selection rate based on heuristics, which is inaccurate. The actual selection rate is only 100 rows.

Get the next issue by email.

The example is simple: the execution plan includes just one retrieve step. The column in the WHERE condition is not indexed, and its estimation does not affect the execution plan. The plan will inevitably entail a full table scan if no alternative access paths (such as join indexes or secondary indexes) are present.

Query execution plans are often more intricate than in the aforementioned simple example. Retrieval would be only one stage among numerous other steps. Miscalculations would result in poor query performance, as illustrated by the example above.

How can we enhance our query? We need to improve the estimations for retrieval.

Below is an example of how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieval step evaluates to "low confidence." The estimated number of selected rows reflects reality:

SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD' AND SUBSTR(CustomerType,10,4) = ‚PRIV';

"We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of ("((SUBSTR(Customer. CustomerType,1,3))= '001')
AND ((SUBSTR(Customer .CustomerType,6,3))= '999')")
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows."

When designing queries with complex WHERE conditions, adhering to a fundamental rule is crucial.

When nesting WHERE condition column functions, prioritize applying the SUBSTR and LIKE functions first.

The Optimizer can utilize column statistics even when querying only a partial character string by leveraging the information stored in statistic histograms.

Unfortunately, statistics cannot be applied if the first function used is TRIM, although there are some exceptions, such as the UPPER or LOWER functions.

My most valuable advice is to prioritize fixing your data model before attempting to address problematic queries.


Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe before the paid plan launches and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Subscribe to DWHPro Letters

Practical field notes on enterprise data engineering, production AI systems, platform migration, and the senior engineering market.
Written by Roland Wenzlofsky Founder of DWHPro Author of Teradata Query Performance Tuning
Get the next issue
Subscribe