Share

Implementing Teradata Compression: A Guide to Managing Data Warehouse Capacity and Saving Space

Learn how to use compression as a table column addition to manage your data warehouse's space effectively. Discover constraints and best practices.

Implementing Teradata Compression: A Guide to Managing Data Warehouse Capacity and Saving Space
tune3

When Teradata compression becomes pressing

Have you noticed your data warehouse nearing capacity or a significant increase in Current Perm for specific tables during data loading?

Implement compression as a supplement to your space management strategy.

Compression is a table column addition that designates frequently occurring values as "given" to avoid taking up space with each instance in the row. The value is stored in a "higher order" within the table header definition.


Want more practical data engineering analysis like this?

Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.

Get the next issue


Before exploring the opportunities, it is important to acknowledge the constraints:

  • Not all table types can be compressed; volatile tables are exempt.
  • Compression is not supported on index or partitioning columns.
  • Data types such as VARCHAR or TIMESTAMP cannot be compressed depending on the Teradata Release you are working with.
  • You can only add compression values while space remains available in the table header section.

Now, what actions are necessary and advisable?

Ideally, reliable data demographics or forecasts should already be available in tabular form. If not, conduct a row count of the attributes you wish to compress and sort them by descending order of row number.

Compression is effective for attributes with few values, particularly for constant flags or type code columns in large tables with millions of records. We advise against compressing IDs or columns with a unique value for each row. Columns with an even distribution of numerous distinct values do not typically yield significant compression results.

A budget approach for getting it done

Deep technical knowledge is not required for selecting columns, gathering documentation, and making recommendations. Use the following method as a quick guide to getting results efficiently.

Apply the following filters to all columns in every table of your database:

Technical filter1:

Eliminate columns that cannot be compressed due to their technical function, such as primary index or partition columns. This filtering is irreversible at this stage and can only be changed by modifying the physical data model.

Technical filter 2:

Filter out columns with incompatible data types for compression. This criterion is more lenient because data types can be modified if the initial design relied more on speculation than caution.

A global minimum benefit filter can be used to eliminate tables that fall below the current size and have no further growth expected. If small tables are not present in the database or if you prefer not to exclude them immediately, you can rank them based on non-technical criteria such as usage in reporting.

Now, we will assess the remaining table columns to determine the concentration of values in each. A column with a high concentration of only a few values is more suitable for compression than one with numerous values spread uniformly. To measure this concentration, we will use the Herfindahl-Index (HI), a standard market concentration measure. The higher the value of the HI for a column, the more potential there is for compression.

To store the numeric basis of our decision, we keep a record of the HI value in a register table like this:

-- COLUMN DEMOGRAPHICS REGISTER

Get the next issue by email.
CREATE MULTISET TABLE P_1111_A.DMnn_COL_HI_INDEX ,NO FALLBACK ,NO BEFORE JOURNAL,NO AFTER JOURNAL,CHECKSUM = DEFAULT(STATUS_DT DATE FORMAT 'YYYY-MM-DD',TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,HI_INDEX DECIMAL(38,18))PRIMARY INDEX ( TableName, ColumnName );

We insert the HI values by first preparing the insert statement for all relevant tables that we will then execute:

-- HHI INDEX CALCULATION PREPARATION

sel TABLENAME, columnname,'INSERT INTO P_1111_A.DMnn_COL_HI_INDEX SELECT CURRENT_DATE, '''|| trim(tablename) ||''','''|| trim(columnName) || ''','|| trim(columnName) || ',  SUM (CAST( (ANZAHL/GES)**2 AS DECIMAL(38,18)) ) AS HI_INDEXFROM( SELECT ' ||trim(columnName)|| ', CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ' || trim(databasename) ||'.'|| trim(tablename) ||'  GROUP BY 1 ) xJOIN(SELECT CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES FROM ' || trim(databasename) ||'.'|| trim(tablename) || ' ) GESON 1=1; '  AS HHIfrom dbc.columnswhere databasename='P_1111_A'and columnname NOT IN ('X1_ID','X2_ID','X3_ID','X4_ID','X5_DT','X6_DT','X7_DT')and substr(TABLENAME,1,1) IN ('P','T')and ColumnType NOT IN ('CV','TS','BO')order by 1,2

;

Note that the query has a few constraints on the dbc column list to remove mere technical fields and exclude non-production tables and columns with non-compressible fields.

Per the table, we select those columns with an HI high enough to prepare an insert statement of compression value candidates. For documentation purposes, we create a second register table first:

-- COLUMN COMPRESSION CANDIDATES

CREATE MULTISET TABLE P_1111_A.DMnn_COL_COMP_CAND ,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT
(
STATUS_DT DATE FORMAT 'YYYY-MM-DD',
TableName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
ColumnName CHAR(30) CHARACTER SET LATIN NOT CASESPECIFIC,
COL_VALUE CHAR(100) CHARACTER SET LATIN NOT CASESPECIFIC,
VALUE_SHARE   DECIMAL(18,2)  )
PRIMARY INDEX ( TableName, ColumnName, COL_VALUE ); -- FIND ALL COMPRESSION VALUE CANDIDATES

sel TABLENAME, columnname,'INSERT INTO P_1111_A.DMnn_COL_COMP_CAND  SELECT CURRENT_DATE, '''|| trim(tablename) ||''','|| trim(columnName) || ', CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) AS VALUE_SHAREFROM( SELECT ' ||trim(columnName)|| ', CAST(COUNT(*)  AS DECIMAL(38,18))  ANZAHL FROM ' || trim(databasename) ||'.'|| trim(tablename) ||'  GROUP BY 1 ) xJOIN(SELECT  CAST(COUNT(*)  AS DECIMAL(38,18)) AS GES , CAST(COUNT(DISTINCT ' ||trim(columnName)|| ')  AS DECIMAL(38,18)) AS DDVAL FROM ' || trim(databasename) ||'.'|| trim(tablename) || ' ) GESON 1=1WHERE CAST( (ANZAHL/GES)*100 AS DECIMAL(38,2)) > CAST( (1/DDVAL)*100 AS DECIMAL(38,2))ORDER BY 3 DESC;'   AS COMPRESSION_VALUESfrom dbc.columnswhere databasename='P_1111_A'and columnname NOT IN ('X1_ID','X2_ID','X3_ID','X4_ID','X5_DT','X6_DT','X7_DT')and substr(TABLENAME,1,1) IN ('P','T')and ColumnType NOT IN ('CV','TS','BO')order by 1,2;

You can use both register tables to put together a compression recommendation.

Register tables facilitate the periodic insertion of updated demographic information using the STATUS_DT field to record demographic changes.

Before finalizing your deployment list, consider the long-term perspective of your compression project.

What is the frequency of your compression list updates?

How frequently do you anticipate it will be required?

What is the level of volatility of the column value set over time?

As long as they exist, Yes-No flags will hold either Yes or No, while daily account balances or order numbers are temporary.

Teradata cannot answer these questions for you. Instead, the answers should come from discussions with business representatives and management — or from your own investigation if those conversations are not possible.

A success story about Teradata compression to inspire you

A year ago, we reviewed numerous tables with little or no compression applied. The database had reached its capacity limit, and acquiring additional space of the required scale was not feasible. An initial examination of the tables' contents showed that compression was a viable option.

The results of introducing compression systematically were very satisfying:

While

  • No other changes than compression were implemented and
  • working in a production environment with daily growth,
  • Over 1 TB of space was saved in total,
  • an average reduction of 25% and 76.2% on a table level.

The result gave us many months more for daily production!

It paved the way for other necessary activities, leading to even more data being inserted.


Planning or surviving an enterprise data platform migration?

I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.

Subscribe before the paid plan launches and keep launch access.

Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.

Subscribe to DWHPro Letters

Practical field notes on enterprise data engineering, production AI systems, platform migration, and the senior engineering market.
Written by Roland Wenzlofsky Founder of DWHPro Author of Teradata Query Performance Tuning
Get the next issue
Subscribe