Optimizing Performance with Teradata ORDER BY: How Sorting is Done in Parallel Without Bottlenecks
Teradata ORDER BY and Performance
To employ a basic sorting algorithm, all rows must be present in one location for sorting. However, this is not feasible in Teradata, where numerous AMPs each retain a segment of every table. Transporting all the rows to one AMP for sorting would result in a non-scalable and bottlenecked process.
Teradata optimizes the shared-nothing architecture, performing parallel sorting at each level without redistributing rows. Sorting constitutes the final step in the execution plan.
Here we present the algorithm employed in the ORDER BY statement of Teradata:
Want more practical data engineering analysis like this?
Join DWHPro Letters and get field-tested notes on Teradata, Snowflake, AI, migrations, performance, and enterprise data work. Early subscribers keep launch access before the paid plan launches.
- First, the table rows are sorted on each AMP where they reside. A higher number of AMPs permits higher sort performance, as all AMPs are sorted in parallel. The sorted rows are put into a spool table.
- After AMPs finish their local sort step, Teradata will return the local spools to the requesting client. For performance reasons, Teradata creates a global buffer per node and on the parsing engine level and informs the client that the result set is available.
- The client fetches as many rows as needed (recall that we often abort requests after having a specific number of rows available in SQL Assistant).
- Each fetch request causes each AMP to move its top row into the buffer, which is merged globally in sort order. Subsequently, each AMP puts its next row into the buffer where the sort order merges them. This process continues until the global buffer is full and sent via the attached network to the client. The merge process itself is handled by the BYNET software (which manages the merge buffer on its own).
- The above-described process continues as long as the client fetches rows from the result set, or until no more rows are available.

Teradata uses a sorting algorithm to arrange data in a specific order.
This sorting algorithm's exceptional performance comes from the following qualities:
- The AMP local pre-sorting step is done in parallel on all AMPs, and sort performance increases with the system size (good scalability).
- Teradata must sort only the client-requested rows in the last global merge step. Usually, when the client requests the whole table (such as in SQL Assistant), we cancel the request after having a small number of rows on our screen. It would not make sense to perform the global sort for millions of rows that the user will never see.
Planning or surviving an enterprise data platform migration?
I write regularly about the performance, cost, architecture, and project mistakes that show up in real Teradata, Snowflake, Databricks, and enterprise data work.
Subscribe before the paid plan launches and keep launch access.
Written by Roland Wenzlofsky, founder of DWHPro and author of Teradata Query Performance Tuning. DWHPro has helped data warehouse practitioners for 15+ years.