Overview of Snowflake Data Warehouse

Snowflake is a fully-managed Cloud MPP data warehouse that enables users to spin up as many virtual warehouses as they need to parallelize and isolate the performance of individual queries. Snowflake runs in the AWS cloud and separates storage and compute, meaning that many warehouses can all access the same data source simultaneously, enabling very high concurrency. Snowflake allows users to interact with its data warehouse through a web browser, the command line, an analytics platform, or via Snowflake’s ODBC, JDBC, or other supported drivers.

Snowflake is a fairly new entrant in the data warehouse market, launched by a group of data warehousing experts in 2014, after two years in stealth mode. Its unique architecture was well-described in a SIGMOD 2016 paper.

Reasons to Choose Snowflake

Flexible Performance

Snowflake allows users to spin up and down “virtual warehouses” (MPP compute clusters) in a matter of minutes in a variety of sizes (which can be changed on-the-fly) to fit any given workload. Queries are isolated, so they consume the resources of the particular virtual warehouse that they’re being run on, and customers only pay for the compute time they use. This means that organizations can elastically scale and partition compute resources depending on the variety of queries being run. Snowflake is also flexible with data types; semi-structured data can be stored and queried right alongside structured data.

Scalable

Snowflake is built on S3, which automatically scales storage resources in-line with data volume. This means users never need to worry about hitting an arbitrary storage limit. Snowflake’s compute resources are also fully based in the cloud, which means that compute resources can also be scaled to ensure that performance doesn’t degrade when the complexity and number of the queries increases.

Unmatched Concurrency

Snowflake allows anyone to create unlimited numbers of virtual warehouses to isolate queries from different users and tools. For organizations using Looker, or other applications with potentially high concurrency, there is also a feature called multi-cluster warehouses which makes it simple to automatically scale a single virtual warehouse to multiple clusters.

Fully Managed

Snowflake has simplified the concept of managing a database cluster and nodes across its platform. What this means is that the most of the aspects of managing a database (that would traditionally take up DevOps resources), such as optimization, data protection, infrastructure monitoring, and encryption, are managed for you by the Snowflake platform.

What use cases is Snowflake great for?

Organizations with high concurrency requirements

Snowflake’s virtual warehouses operate as independent MPP compute clusters, which means that organizations can provision one virtual warehouse to handle ETL processes, and another instance to handle ad hoc analytical queries. This doesn’t require any data movement, and queries run on one virtual warehouse have no impact on other virtual warehouses. Snowflake’s multi-cluster warehouse feature can automatically scale a virtual warehouse (by adding new compute clusters to it) to provide consistent performance at any level of concurrency.

Additionally, for embedded analytic use cases, which need to support hundreds or thousands of simultaneous queries, Snowflake’s virtual warehouses provide an effective way to support that many users via multi-cluster warehouses that automatically increase the number of compute clusters depending on concurrency.

Companies with very large and growing data

Snowflake is a great solution for organizations that are interested in rapidly scaling their data stack with minimal overhead. Because Snowflake uses S3 for storage, storage is basically limitless. Snowflake’s architecture separates storage and compute such that any number of virtual warehouses can access the same data, and the cost of storing data is comparable to the cost of S3.

Analysts needing to work with semi-structured data

Snowflake has a comprehensive set of functions for working with semi-structured data and a range of data formats including JSON, Avro, ORC, Parquet, and XML. Snowflake provides a great deal of flexibility in regards to storing and querying semi-structured data.

One excellent example of Snowflake’s support for semi-structured data is Snowflake capacity to store multiple types of data in a VARIANT data value, a data format capable of storing multiple types of data without changing the structure of the data itself as seen by the user.

Companies that want to share or monetize their data

Snowflake introduced Data Sharing, which allows companies to share their data with customers, partners, or vendors by loading that data into Snowflake and enabling access to any participating party. Unlike traditional methods for sharing data, there’s no data deconstruction, movement, or transmission. The data provider simply grants access through a specialized database object to their data consumers (customers/partners). As they onboard new data consumers, the data provider can continue to share the same data, or subsets of that data, with minimal effort and standard SQL.

Before You Commit

Pricing

Snowflake, like many cloud MPP data warehouses, has two payment options: on-demand and capacity. The on-demand pricing option offers users a “pay-as-you-go” model with no long-term commitment. For organizations seeking deeper discounts, the capacity payment option offers a discounted rate based on a capacity commitment.

Snowflake also has four editions, each of which carries slightly different functionality and SLAs. Those editions are:

More details about Snowflake’s pricing can be found on their pricing page.

It is worth noting that pricing for Snowflake can vary based on deployment location.

Snowflake Architecture

Snowflake uses a patented multi-cluster, shared-data architecture, which makes this data warehouse unique among Cloud MPP providers. “Multi-cluster” means that Snowflake has the ability to spin up multiple individual compute engines, or virtual warehouses, on top of a common storage layer.

Snowflake’s architecture has two distinct levels:

  • Data Storage - stores table data and temp data associated with queries
  • Compute - compute engines that individual users can spin up
  • Services - the management “brains” of Snowflake, responsible for security, metadata, optimization, resource management, and more

Data Storage

Snowflake uses Amazon S3 as a shared data repository for its platform because of S3’s usability, high availability, and durability. Snowflake uses S3 to store table data and temporary data associated with very large and expensive queries. Spilling temporary data into S3 ensures that Snowflake can execute these large queries without throwing an “out-of-memory” or “out-of-disk” error.

Virtual Warehouses

In order to process queries, Snowflake allows the user to provision clusters of Amazon EC2 instances, referred to as virtual warehouses (VW).

The individual EC2 instances that form the VW are abstracted from the user, and the user cannot control exactly how many instances are contained within each virtual warehouse. Rather, the user can select from virtual warehouses ranging in sizes.

Virtual warehouses are purely compute mechanisms which can be created, destroyed, or resized on-demand, with Snowflake encouraging users to shut down their virtual warehouses when there are no queries running (so they’re not charged). This architecture gives Snowflake its elasticity, or the ability to scale queries in-line with demand.

Each of these virtual warehouses operate as completely separate compute clusters, with no queries or resources being shared between them. Queries therefore can only run on one VW and cannot be shared between virtual warehouses, resulting in strong performance isolation across VWs. The power of Snowflake comes when multiple VWs are spun up and used for different tasks, running at the same time, with multiple concurrent queries each. With the “multi-cluster warehouse” feature, VWs can be configured to automatically scale for concurrency of queries.

Services

Snowflake’s cloud services layer handles all of the management and administration of the Snowflake platform. In contrast to the compute resources, which are designed to be spun up and down on-demand, Snowflake’s services layer is always-on, meant to handle management processes for multiple virtual warehouses.

Snowflake’s services layer is responsible for a host of processes, including query management and optimization, security, user management, and metadata management.

Types of Data

Snowflake can ingest both tabular and semi-structured data originating from text (delimited) files, JSON, Avro, Parquet, XML, and columnar formats.

Unlike with many other databases, the requirements for pre-processing, flattening, or denormalizing data sets before loading is less of an issue with Snowflake. This is particularly true for semi-structured data, because Snowflake automatically columnarizes data upon load. Compression is applied automatically once data is loaded into Snowflake.

Maximum Recommended Size of Data

Because Snowflake is built on S3, which scales storage seamlessly with data volume, as well as Snowflake’s compute architecture that separates storage from compute, there is no architectural limit on the amount of data that can be stored and processed.

Implementing Snowflake

Implementing Snowflake

Moving to Snowflake is made easy due to its cloud storage abstraction. Since Snowflake uses AWS resources it is simple for customers storing their data in an Amazon S3 bucket to load it into Snowflake.

This means that if your organization’s data is already in S3, Snowflake just needs to be pointed at your S3 repository to load that data into Snowflake(a process that takes minutes) and you can start querying.

Process for new data

There are multiple ways to upload new data in Snowflake:

  • The Web Interface is a convenient tool to quickly load limited amounts of data into a table from a small set of flat files.
  • The command line interface SnowSQL allows you to load data:
    • From an external Amazon S3 bucket, through a 4 step process (file staging, listing files, copying files into the tables and verifying the loaded data);
      • The files in S3 can be in any supported format including CSV and JSON.

Snowflake also offers parallelized data upload and querying which is paired with the option to run multiple compute clusters to deliver a high performance environment that eliminates traditional issues relating to concurrency.

For assistance building out ETL pipelines, you can utilize an ETL vendor that integrates with Snowflake. Contact Looker for more information.

Maintenance

Snowflake provides a monitoring tool, and charts displaying the warehouse load over a two-week period. This chart allows the user to decide to spin up an additional warehouse to move the queued queries to this warehouse or modify the current one to add clusters and handle higher concurrency. It is also possible to see query execution paths and processes.

Data Protection and Availability

Snowflake’s use of S3 as a storage layer means that data stored in Snowflake inherits [the full resilience](https://aws.amazon.com/s3/details/ of S3). The Snowflake platform also builds on S3’s resilience by adding in features such as time travel (enabling users to access all historical data within a defined period of time), and fail-safe (ensures data is protected in the event of a hardware failure or security breach). Files and traffic network are fully and automatically encrypted by default, in transit and at rest. Snowflake is also fully SOC 2, PCI and HIPAA compliant.

For clients with additional security requirements, Snowflake offers dedicated compute resources and additional security features. Snowflake integrates Security Information and Event Management tools into its platform to help users identify suspicious activity and alert customers.