On-Demand MPP Database Overview
On-Demand MPP Databases are cloud-hosted, analytical data warehouses that dynamically adjust their size depending on the difficulty of their workload.
In order to automatically scale compute resources in line with query size and complexity, the architectures of these databases characteristically separate storage from compute. To deal with storage, these databases leverage massive shared cloud infrastructure that provides essentially limitless storage (e.g. AWS S3, Azure Storage, and Google Cloud Storage).
In many cases, they provide the ability to process semi-structured or unstructured data, as well as the structured data that warehouses can more generally handle.
Even though on-demand MPP databases are enormously complicated, from the end users’ perspective, they’re actually quite simple to operate. This is because the physical hardware and many (or all) of the complex technical processes are handled by the cloud provider. This ensures a seamless user experience for uploading and querying data.
Compared to self-managed MPP databases, where the user is responsible for upsizing their cluster to increase storage or computing power, on-demand databases can be easily, and in some cases automatically, scaled up. As in most cases, by outsourcing Ops work makes things easier for you, but can limit customization options for more advanced users.
What are On-Demand MPP Databases really great for?
Consistent Performance no matter the size of your data
On-Demand MPP databases are architected to pull in as many compute resources as necessary to execute a query efficiently, regardless of how large the query or dataset is. For the end user, this generally means queries are never slow.
Ease of use
Much of the hardware and complex technical procedures for these databases are abstracted away from the end user, allowing end-users to spin them up, and manage them without a lot of dev/ops help.
Because storage is basically limitless and compute resources can easily be scaled up or down (if they’re not automatically scaled for you), these systems need much less hand-holding than your average self-managed solution.
Paying for only what you use
Although each on-demand MPP database has its own pricing structure, in general, their approach is to provide variable pricing based on usage, rather than a huge up-front cost. This, combined with their ease-of-use, makes them particularly great for trying out to see what you think.
Popular On-Demand Database Solutions
Database Architecture of On-Demand MPP Databases
Storage and compute resources are separate
A large difference between on-demand MPP Databases and managed MPP databases is that on-demand databases decouple storage from compute resources.
Self-managed MPP databases are composed of clustered servers (often called nodes) and, for efficiency, colocate storage and compute capabilities in each node. This cuts down on networking costs and latency, but means that as you scale up compute you must also scale up storage (and vice versa). This type of architecture is known as a “Shared-Nothing Architecture”, because each node contains its own computing and storage resources.
On-Demand MPP databases, by contrast, share storage and compute resources across the entire instance, allowing both to scale seamlessly with the number and size of queries. This architecture enables consistently fast performance regardless of datasize and can also multiple compute clusters to access the same stored data without moving it.
Storage scales seamlessly
Rather than distributing data tables across a cluster of nodes, on-demand MPP databases leverage massive shared cloud object stores, such as Amazon S3, Microsoft Azure Storage, or Google Cloud Storage as storage receptacles. One of the benefits of utilizing these object stores is that they’re able to store structured, semi-structured, and unstructured data. Though capabilities for processing this data varies from database to database, all on-demand MPP databases can at least access this non-structured data, meaning they can add functions for handling it in the future.
These object stores are also nearly infinitely scalable. Unlike a managed MPP architecture where storage is limited to the disk space available on each node, (and must be manually upsized when the amount of available storage begins to run out), massive distributed object stores such as S3 are designed to always provide additional space for your data, and automatically grow as more data is added, with no discernable effect on performance.
Computing Scales Seamlessly
Decoupling storage from compute resources allows these databases to scale processing power as necessary for individual queries. Elastic databases are able to leverage vast processing infrastructures, comprised of hundreds or thousands of individual nodes, and devote the processing power of these nodes to individual queries for seconds at a time.
When the computing power required for these queries is spread over hundreds of nodes, query return time remains quick no matter the size of the query. This means that queries that whether you’re doing a simple count of a few million rows, an expensive REGEX on a 100-billion row table using BigQuery or a query over exabytes of data stored in S3 using Redshift Spectrum, you can expect your query to return in seconds or minutes, not hours or days.
On-Demand MPP Database Constraints
On-demand MPP databases trade simplicity and efficiency for less flexibility and customization. So, if you’re upset about the lack of customization in an on-demand MPP database, the solution is probably to use a self-managed MPP database. Similarly, if you’re overwhelmed by the dedicated resources required to maintain a self-managed MPP databases, you should look at an on-demand database as an alternative.
The stark tradeoff between benefits and constraints arise from the architecture of these databases. Managed MPP database plans loan or sell consumers individual servers or portions of servers, which they are then free to configure and customize. On-demand databases, on the other hand, loan consumers the processing power from their massive cluster at query-time, but don’t loan consumers individual machines, which limits customization.
A second constraint is visibility. Because consumers don’t have access to individual nodes within a large bucket such as S3 or Google Cloud, it’s very difficult to know exactly where their data actually resides. Beyond structuring of individual queries, database administrators also don’t have much control over optimization of clusters, or fine-tuning query plans.
Optimizing On-Demand MPP Databases
Because most or all of the optimization tools for these databases are hidden from the end user, and are managed by the database vendors themselves, there is very little performance optimization that must be done on an elastic database.
This means that these databases will stay performant with very little tuning. However, you do need to be careful about costs, because these databases generally charge based on usage. Each of these databases offers different strategies for reducing the size and complexity of workloads (and thus their cost), which we’ll dive into more in the guides to each.