What is Teradata?
Teradata is a petabyte-scale, highly customizable MPP database solution. In existence for over 30 years, Teradata has an extremely mature feature set that caters to the most demanding requirements of some of the largest organizations in the world. Its shared-nothing architecture makes Teradata a powerful database capable of linear scalability.
While Teradata has traditionally been installed on-premises with proprietary hardware, Teradata has recently begun offering its database technology as a managed service in AWS or in their own private cloud.
Choosing Teradata as a Database
Because Teradata is one of the oldest and largest database providers in the world, the company has been able to built a completely comprehensive feature set that makes it possible to make Teradata work for almost any use case. A couple notable features include Teradata QueryGrid, a federation layer that enables access to datasets stored in Teradata and Hadoop, and Teradata’s powerful Workload Management software.
Given that Teradata bundles both the database hardware and software together, it’s possible to select from a large selection of platforms that vary based on workload. This makes it very easy to select hardware that’s optimized for your use case.
Some of Teradata’s features, if smartly implemented, can make Teradata an extremely performant database. For example, Teradata implements a hybrid row store / column store, which allows the user to store data either in row or column format. Another feature that boosts performance is the ability for Teradata to intelligently read data most frequently accessed into in-memory storage.
What is Teradata great for?
Fitting into a Large Enterprise Data Stack
The value of Teradata’s database solution can really only be realized at scale. Teradata’s massive functionality, integration, and concurrency is designed to be able to fit seamlessly into an enterprise data stack and be leveraged by a large number of users.
Building a Custom Data Stack from the Ground Up
Given Teradata’s mature feature set, large number of platform offerings, and ability to exist both on-premise and in the cloud, Teradata’s RDBMS is perfect for competent DBAs that want to customize their data stack from top to bottom.
Pricing for Teradata is going to vary based on whether the database is hosted on-premises or in the cloud. For Teradata deployed in the cloud, the pricing will depend on the the hosting provider (AWS / Azure / Teradata IntelliCloud).
On-Premise pricing data for Teradata will depend on the type, number, and size of the appliances purchased from Teradata. Specific pricing information is not included on the Teradata site.
Teradata on AWS
Teradata on AWS is sold on the AWS Marketplace, and comes with a pricing calculator that you can use to estimate what the monthly rate for an instance of Teradata would look like. Variables include feature set, storage type, instance size, and number of nodes. It’s possible to scale up to 32 nodes on AWS.
Teradata on Azure
Teradata on Azure is sold on the Azure Marketplace. The only plan that is offered on the Azure marketplace is an hourly plan. Teradata also integrates with Azure Blob Storage to manage backups and store additional data. It’s possible to scale up to 32 nodes on Azure.
Teradata also offers a fully managed software and hardware hosting option on their private cloud, called Teradata IntelliCloud. Pricing options vary for different Teradata and can be found here. Hosted by Teradata, the Teradata database is fully scalable.
Like most self-managed MPP databases, Teradata distributes both storage and compute to multiple nodes within its system. This architecture, known as a “shared-nothing-architecture”, means that nodes contain both compute (in the form of CPU and memory), and storage (in the form of disk space).
There are several components to the Teradata Architecture:
- Parsing Engine (PE) - Similar to a “leader node”, the parsing engine receives the request from a user, and uses the SQL from the user to construct a query plan that is then distributed to the BYNET
- BYNET - The communication engine between the parsing engine and the nodes (known as AMPs). BYNETs receive the query plan form the PE, and then decide which AMPs must be utilized.
- Access Module Processor (AMPS) - AMPs function as the “worker nodes” which contain a section of the distributed table, and process the query for the section of the data that resides on their individual disks. The intermediate results are then passed through BYNET to the PE, which in-turn passes that data to the user.
Types of data
Teradata stores the data in either row or columnar format, depending on the table Data Definition Language (DDL*). It can ingest both tabular and nested data structures originating from text (delimited) files, JSON, larger character objects (BLOB and CLOB), as well as geospatial data.
*Note that while Teradata can orient table storage in either direction (row or column), it will operate on the data in row format upon computation. Therefore, typical performance gains from column-based storage may be lower when compared to a native columnar database engine (Amazon Redshift or Vertica).
Maximum Recommended Size of Data
Teradata can support 50+ petabytes in the on-premises capacity. You can attain similar scalability with their private cloud, as it will run on the same hardware that powers on-premises installations. Since Teradata is still figuring out the intricacies of the public cloud, Teradata on AWS is a single-node cluster only with support for up to 48 TB of HDD storage (SSD storage limits are lower).
Database Management for Teradata
Teradata offers step-by-step instructions for implementing Teradata on AWS and Azure. For details about implementing Teradata on-premise, or on Teradata’s private cloud, it’s best to contact them directly.
Teradata also provides implementation services to help you get set-up with their platform.
Process for new data
You can leverage a standard SQL
COPY operation when ingesting data into Teradata. Higher volumes require Teradata’s proprietary ETL tools, called Teradata Parallel Transporter (TPT).
While the hardware component of Teradata can be fully managed, table architecture and query optimization is not. In order to attain acceptable performance out of Teradata, you must apply column encoding, partitioning, indexes, and statistics. General MPP tuning strategy can be applied when tuning a Teradata database.
Teradata has some of the best
EXPLAIN plans we’ve ever seen
If you’ve ever run an
EXPLAIN plan on your database, you know that they can be technically complex and require a fair amount of training to be able to read effectively. Teradata’s explain plans are human-readable, making it very easy to understand where the query is going.
Here’s an example of an
EXPLAIN plan for a full table scan from Teradata (from Teradata Wiki):
EXPLAIN SELECT * FROM Emp_Table; 1. First, we lock a distinct SQL_CLASS."pseudo table" for read on a RowHash to prevent global deadlock for SQL_CLASS.Emp_Table. 2. Next, we lock SQL_CLASS.Emp_Table for read. 3. We do an all-AMPs RETRIEVE step from SQL_CLASS.Emp_Table by way of an all-rows scan with no residual conditions into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 6 rows (342 bytes). The estimated time for this step is 0.03 seconds. 4. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.03 seconds. If you see all-AMPs RETRIEVE by way of an all-rows scan in your plan, that means teradata is doing a Full Table Scan. So it is reading every row in the table.
For contrast, here’s an
EXPLAIN plan for a similar query from Postgres documentation.
EXPLAIN SELECT * FROM tenk1; QUERY PLAN ------------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)