Overview

Vertica is a high-concurrency MPP database platform that’s prioritizes highly performant, advanced analytical workflows on massive datasets. Like many self-managed MPP data platforms, Vertica’s platform is mature and and is architected to fit into an enterprise data stack. Vertica core competency lies in its novel architecture that results in high performance and efficiency for extremely large datasets.

Vertica is particularly well suited for advanced analytical and data science workflows due to its strong integration with R and data federation capabilities with Hadoop.

Choosing Vertica as a Database

Cost-Effective

Vertica is a cost-effective and flexible solution because unlike other self-managed MPP databases, Vertica can be deployed and run on commodity hardware making it relatively inexpensive to provision new nodes and scale the database.

Performance

Vertica’s columnar storage is optimized for speed and efficiency, cutting disk I/O overhead and making indexes and materialized views obsolete.

Advanced Analytic Functions

Like other self-managed MPP databases, Vertica has been around for a long time, and has a mature and stable SQL dialect. However, Vertica also contains native support for geospatial analysis, distributed-R, machine learning and other advanced toolkits that put it ahead of most other MPP platforms on advanced analytical tools.

What is Vertica great for?

MPP Performance in an On-Premises Environment

Vertica can be deployed on-premises on commodity hardware (much like Hadoop), enabling organizations to scale Vertica’s storage and performance predictably and linearly without relying on proprietary hardware. Vertica’s high availability features also enables you to load and analyze data continually, regardless of hardware failures.

Data Science/ Machine Learning Workflows

Organizations that rely heavily on data science and machine learning workflows may want to consider Vertica for its SQL Engine capable of querying HDFS directly (similar to Presto, Hive, and Spark), as well as its integration with distributed-R, allowing analysts and data scientist to use R to query massive datasets

Price Considerations

Vertica licenses are determined by any number of factors including duration terms, and amount of raw data stored in the platform. For an exact price estimate based on your specifications, it’s best to contact the Micro Focus Vertica team directly.

It’s also possible to download Vertica’s Community Edition, which is free up to 1TB on 3 nodes.

Vertica Architecture

Vertica is engineered with the 4 C’s in mind: column storage, compression, clustering, and continuous performance (availability).

Column Storage

Vertica’s stores columns from tables on database objects called projections. Projections operate similar to materialized views, designed to speed up query execution. This means grouping columns that are frequently accessed together into one projection. This could even involve pre-joining some columns together in single projects that dramatically increase query performance on joined tables.

Column Grouping is part of Vertica’s FlexStore™, which is “a combination physical design, database storage, and query execution techniques” that’s designed to optimize query performance. Another important aspect of FlexStore is to ensure that frequently used data can be accessed more quickly by moving such data onto faster and more performant hardware.

Compression

Column stores, by virtue of storing columns as separate objects, can more effectively compress data because data that resides in a particular column is going to be a lot similar than data within a particular row.

Vertica utilizes LZO compression on compressible columns, which is particularly optimized for extremely fast decompression. Vertica also utilizes integer packing on integer values. Integer packing as a compression algorithm is demonstrated here.

Clustering

Vertica is built on a distributed shared-nothing architecture—a staple of analytical MPP databases. However, unlike many MPP distributed databases, Vertica was designed to operate without a leader node. Instead all nodes within a Vertica cluster have the ability to become an Initiator Node, any node within the cluster can receive a connection and form a query execution plan.

Connected to a load balancer, this architecture not only allows for high-concurrency, but also high availability, as there is no single point of failure within a Vertica cluster.

Continuous Performance

Vertica is designed to auto load/trickle data continuously into the database by default. Unlike other MPP databases that rely on regular intervals of batch uploads of data into the database, Vertica continuously loads data with minimal management, a process that minimizes latency.

Types of data

Vertica stores data in a columnar format. It can ingest tabular data as well as support semi-structured data via flex tables such as Common Event Format (CEF) data, delimited data, JSON data and loading matches from regular expressions. Vertica’s SQL on Hadoop functionality also allows for querying data in text, ORC, or Parquet formats.

Maximum Recommended Size of Data

Vertica is a petabyte-level database and is capable of scaling to handle petabyte-level workloads. However, it’s important to note that complexity and price considerations will also scale along with storage and compute capacity.

Using Vertica

Process for new data

Before loading data into your database, you’ll want to create tables and specify the datatype for each column in each table. You’ll then want to think about how those tables will be used so you can create effective projections and group columns across those tables effectively.

Maintenance/What you need to implement

While the hardware component of Vertica can be fully managed, table architecture and query optimization is not.In order to attain acceptable performance out of Vertica, you must apply column encoding, partitioning, indexes, and statistics. General MPP tuning strategy can be applied when tuning a Teradata database.

Distributed-R

Vertica enables R scale to extremely large datasets. This allows you to create user-defined functions (UDFs) in R and load them into the Vertica database. This blog post provides an example of how that’s done.

Vertica’s SQL on Hadoop Engine

The Vertica platform comes with a SQL Engine that rivals other SQL on Hadoop dialects such as Presto, Hive, and Spark. You can thus use Vertica's mature SQL querying capabilities on Hadoop data while fully leveraging the power of the Hadoop cluster. You can take advantage of both Vertica’s performance and analytical functions across both native Vertica and Hadoop environments from one central location.

Random Facts About Vertica

Vertica’s designer Michael Stonebraker knows a thing or two about databases. The A.M. Turing Award Winner, in addition to designing Vertica, also pioneered the development of databases Postgres and Ingres. He currently works at MIT as an adjunct professor and researcher.