Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns
Learn about optimizing SQL queries in Teradata 14.10 by analyzing the execution plan of two queries and using statistics on columns and expressions.
I want to present two queries for Teradata 14.10 before we delve into the specifics.
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';CREATE TABLE TheTable(PK INTEGER NOT NULL,TheCol CHAR(10)) PRIMARY INDEX (PK);COLLECT STATISTICS COLUMN (PK) ON TheTable;COLLECT STATISTICS COLUMN (TheCol) ON TheTable;The "TheTable" table in our test scenario comprises 200,000 rows. We obtained statistics on each column, including those that are indexed and non-indexed.
Both queries yield identical results. However, which query performs better is not immediately obvious.
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
Must the Optimizer utilize statistics for "TheCol"?
Let us examine the first query's execution plan.
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');
3) We do an all-AMPs RETRIEVE step from Indexing. Test2 by way of an all-rows scan with a condition of ("((SUBSTR(TheTable. TheCol,1,1))<= '2') AND ((SUBSTR(TheTable. TheCol,,1,1))>= '1')") into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).
The Optimizer approximates the result set to contain 40,000 rows, which is precisely 20% of the table's total rows. It's worth noting that this estimation holds no confidence, indicating that the optimizer used heuristics to determine the number of rows.
Get the next issue by email.
The collected statistics were unusable. The 20% heuristic is used for estimating closed ranges.
Let's revisit our second query without altering any statistics:
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');3) We do an all-AMPs RETRIEVE step from Indexing. Test2 by way of an all-rows scan with a condition of ("((SUBSTR(TheTable. TheCol ,1,1))= '1') OR ((SUBSTR(TheTable. TheCol,1,1))= '2')") into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).
The explain plan indicates that the Optimizer utilizes statistics found in the "TheCol" column. The estimation is of "low confidence" type, and the projected number of rows closely approximates the actual number of 122,000.
Conclusion:
To optimize queries for closed ranges with limited values, use the IN(val1,...,valn) variation. This enables the Optimizer to utilize statistics.
Lastly, allow me to present an alternative option available in Teradata 14.10:
Teradata Tuning with Expression Statistics
COLLECT STATS COLUMN SUBSTR(TheCol,1,1) AS CHAR1 ON TheTable;SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';The Optimizer can utilize available statistics for both queries with high confidence.
Limiting the use of expression statistics is advisable since they are only beneficial for certain queries. Column statistics, on the other hand, have a broader range of uses. In our test scenario, for instance, we can solve the issue using column statistics and the IN list syntax, thereby avoiding the need for expression statistics.
Summary:
SQL is typically studied in diverse settings and through multiple educational resources. Our familiarity with specific statements can lead to their frequent use. This article highlights the benefits of exploring alternative approaches.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.