Filter efficiently with Teradata NOS

tune3

Teradata NOS facilitates querying data in an S3 object store with ease. To attain maximum performance, partitioning external data is crucial for efficient reading. This article outlines the key considerations for optimal efficiency when reading data from the object store.

To begin, we must establish S3 access by obtaining an AUTHORIZATION object. In this instance, we will be utilizing Amazon AWS, which requires the following credentials (note that these are not my personal credentials): user and password.

CREATE AUTHORIZATION authorization_dwhpro AS DEFINER TRUSTED USER 'AKIAU56DUTZWABDFJ7AGC'PASSWORD 'LtOkSzse74ndk9388uL6y8WBnAAP9gGuNlr1/';

In the next step, we create a FUNCTION MAPPING (WRITE_NOS as we want to write parquet files) and link it to our AUTHORIZATION object:

Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. DWHPro Letters is free. Subscribe to get new issues by email.

Get the next issue

CREATE FUNCTION MAPPING WRITE_NOS_DWHPROFOR WRITE_NOS EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro USING LOCATION, STOREDAS, MAXOBJECTSIZE, COMPRESSION, NAMING, INCLUDE_ORDERING, INCLUDE_HASHBY, MANIFEST FILE, MANIFESTONLY, OVERWRITE, ANY IN TABLE;

Next, we will create a FUNCTION MAPPING called WRITE_NOS, allowing us to write parquet files. This mapping will be linked to our AUTHORIZATION object. With these preparations, we can now write a test table to our Amazon S3 bucket as parquet files. Our test table consists of one date column, one INTEGER column, and one DECIMAL column. To enable later filtering, we partition the parquet files by the year of the date column and include this column in the files.

SELECT NodeId, AmpId, Sequence, ObjectName, ObjectSize, RecordCountFROM WRITE_NOS_DWHPRO (ON ( SELECT t01.*, EXTRACT(YEAR FROM TheDate) AS TheYear FROM DWHPRO.Table1 t01) PARTITION BY TheYear ORDER BY TheYear USING LOCATION('/s3/dwhpro.s3.amazonaws.com/data/') STOREDAS('PARQUET') COMPRESSION

This is what our S3 bucket looks like after we exported the test table. Under "data/" there is one directory per year in which the corresponding parquet file is located.

Next, we create a FUNCTION MAPPING (READ_NOS as we want to read parquet files) and link it to our AUTHORIZATION object. We can now read our test table from the parquet files in our Amazon S3 bucket. However, to read parquet files, you need an EXTERNAL table (this is not the case with CSV and JSON).

CREATE FUNCTION MAPPING DWHPRO.READ_NOS_DWHPROFOR READ_NOS EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro USING BUFFERSIZE, SAMPLE_PERC, ROWFORMAT, RETURNTYPE, HEADER, MANIFEST, LOCATION, STOREDAS, FULLSCAN,ANY IN TABLE;

Get the next issue by email.

Before creating the EXTERNAL TABLE, we will get the metadata (with NOSREAD_PARQUET_SCHEMA) from parquet files to know what the external table should look like. We decide to take the one for the year 2022:

SELECT * FROM READ_NOS_DWHPRO (USING STOREDAS('PARQUET') FULLSCAN('TRUE')RETURNTYPE('NOSREAD_PARQUET_SCHEMA') LOCATION ('/s3/dwhpro.s3.amazonaws.com/data/2022')) AS D;

Once the query is executed, metadata becomes available, providing insight into our external table's required Teradata data types.

Col	Name	Teradata Datatype	Physical Type	Logical Type	Precision	Scale
1	PK	INTEGER	INT32	NONE	0	0
2	A_Number	DECIMAL	BYTE_ARRAY	DECIMAL	38	18
3	TheDate	DATE	INT32	DATE	0	0
4	TheYear	INTEGER	INTEGER	NONE	0	0

Creating a FOREIGN Table is now effortless with the following statement. It is essential to have a column-partitioned table for parquet files. Alternatively, you could replicate the FOREIGN table data into a Teradata table and select the most appropriate DDL for your workload. When reading data, the PATHPATTERN allows variable usage for folder navigation in the S3 bucket.

CREATE FOREIGN TABLE Table1_Parquet, EXTERNAL SECURITY DEFINER TRUSTED authorization_dwhpro(Location VARCHAR(2048) CHARACTER SET UNICODE CASESPECIFIC, PK INTEGER, A_Number DECIMAL(38,18), TheDate DATE, TheYear INTEGER) USING (LOCATION ('/s3/dwhpro.s3.amazonaws.com/data')PATHPATTERN('$data/$year1/$year') STOREDAS ('PARQUET')) NO PRIMARY INDEX, PARTITION BY COLUMN ;

Column Filtering

Column filtering corresponds to a conventional filter in the WHERE condition. However, all folder/parquet files under /data must be read and brought into Teradata. Depending on how much data we have swapped out, this can have a huge impact on the performance of the query.

SELECT * FROM Table1_Parquet WHERE TheYear = 2022;

Path Filtering

We can pre-determine the paths to be read from the S3 bucket by utilizing path filtering. As the folders have been organized by year, we can optimize the SQL query by using PATHPATTERN to navigate through the folders. This approach yields the same result as column filtering but with increased efficiency.

SELECT * FROM Table1_Parquet WHERE $PATH.$year = 2022;

Performance Results

After executing the queries, we examine the query log for runtime.

SELECT STARTTIME, FIRSTRESPTIME, AND QUERYBAND FROM DBC.DBQLOGTBLWHERE QUERYBAND LIKE '%Filter%'AND STATEMENT TYPE = 'SELECT' ORDER BY STARTTIME DESC;

STARTTIME	FIRSTRESPTIME	QUERYBAND
2022-10-20 15:03:34.22	2022-10-20 15:07:34.35	Column
2022-10-20 14:53:06.03	2022-10-20 14:53:09.33	Path

Path Filtering returns the query in seconds, whereas Column Filtering takes several minutes.

EXPLAIN PLAN Column Filtering, moving all sub-folders to Teradata:

1) First, we lock DWHPRO.Table1_Parquet for read.2) Next, we do a single-AMP RETRIEVE step from DWHPRO.Table1_Parquet metadata by way of an all-rows scan with no residual conditions into Spool 2 (one-amp), built locally on that AMP. Then we do a SORT to order Spool 2 by the sort key. The size of Spool2 is estimated with low confidence to be 202 rows (142,410 bytes). The estimated time for this step is 0.50 seconds.3) We do a single-AMP RETRIEVE step from Spool 2 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is bin packed and redistributed by size to all AMPs in TD_Map1. The size of Spool 3 is estimated with low confidence to be 202 rows (144,026 bytes). The estimated time for this step is 0.00 seconds.4) We do an all-AMPs RETRIEVE step in TD_MAP1 from 6-column partitions of DWHPRO.Table1_Parquetby way of external metadata in Spool 3 (Last Use) with a condition of ("DWHPRO.Table1_Parquet.TheYear = 2022") into Spool 1(group_amps), which is built locally on the AMPs. The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 1 is estimated with no confidence to be 867,775 rows (636,079,075 bytes). The estimated time for this step is 14 minutes and 9 seconds.5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.-> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 14 minutes and 10 seconds.

EXPLAIN PLAN Path Filtering, filtering done by only extracting data from the /data/2022 folder:

1) First, we lock DWHPRO.Table1_Parquet for read.2) Next, we do a single-AMP RETRIEVE step from DWHPRO.Table1_Parquetmetadata by way of an all-rows scan with a condition of ("(TD_SYSFNLIB.NosExtractVarFromPath (DWHPRO.Table1_Parquet.Location, '/s3/dwhpro.s3.amazonaws.com', 3)(FLOAT, FORMAT '-9.99999999999999E-999'))= 2.02200000000000E 003")into Spool 2 (one-amp), which is built locally on that AMP. Then we do a SORT to order Spool 2 by the sort key. The size of Spool2 is estimated with no confidence to be 21 rows (14,805 bytes). The estimated time for this step is 0.50 seconds.3) We do a single-AMP RETRIEVE step from Spool 2 (Last Use) by way of an all-rows scan into Spool 3 (all_amps), which is bin packed and redistributed by size to all AMPs in TD_Map1. The size of Spool 3 is estimated with no confidence to be 21 rows (14,973 bytes). The estimated time for this step is 0.00 seconds.4) We do an all-AMPs RETRIEVE step in TD_MAP1 from 6-column partitions of DWHPRO.Table1_Parquetby way of external metadata in Spool 3 (Last Use) into Spool 1(group_amps), built locally on the AMPs. The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 902,142 rows (661,270,086 bytes). The estimated time for this step is 14 minutes and 10 seconds.5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request.

Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe for free and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Column Filtering

Path Filtering

Performance Results

Subscribe to DWHPro Letters