What’s Hadoop and how does SQL fit in?
Hadoop is an open-source framework that came on the scene in the early 2000s for storing and processing really big data. It uses parallel processing on commodity hardware to distribute computation across huge clusters of machines.
Hadoop jobs are usually written in Java. But because this can be complicated and analysts largely still prefer to write SQL, several computation engines have been developed that allow you to write SQL, and have it run on a Hadoop cluster.
Hadoop’s core strength lies in its modularity. Traditional relational database management systems (RDBMSs) and MPP data warehouses bundle their file storage, compute engines, and data management systems together in one platform. With Hadoop, all of these components operate separately and independently of each other. Hadoop as a software category is massive, and the entire ecosystem consists of dozens of software vendors, support/service providers, and applications.
Unlike many analytical data warehouse offerings, Hadoop’s software is designed to run on commodity hardware, which means the entire software architecture also operates separately and independently of the hardware.
These two features, taken together, mean that a skilled DBA or DevOps team can take steps to customize every part of their data stack with Hadoop. This means that if you possess a Hadoop cluster, it’s going to be important to know the different benefits and drawbacks between vendors.
What are advantages of using SQL on Hadoop?
Cost Effectiveness and Flexibility
One of Hadoop’s major advantages is that the software is open source and designed to run on commodity hardware, or non-proprietary hardware that is not tied to any one software provider.
This feature is significant because it allows organizations to add more machines to their Hadoop cluster at a predictably low cost. Organizations also have the ability to swap out different Hadoop technologies to optimize their processes at any given moment without needing to replace their whole hardware stack.
For example, organizations interested in provisioning their Hadoop cluster as a data warehouse can add a SQL engine such as Hive to query their data via SQL. They can then swap Hive out for another SQL engine such as Presto or Spark, without the need to change their hardware. This flexibility is much greater than with other on-premise data warehouse vendors because of the interoperability of Hadoop-based systems.
Hadoop users don’t need to declare a schema when they store data. This makes the Hadoop Distributed File System (HDFS) a cost effective way for organizations to store massive amounts of unstructured data, and declare a schema only when they want to analyze the data. This eliminates the need to declare a potentially limiting schema up front when writing data to the cluster.
Hadoop’s infrastructure extends far beyond SQL, and is compatible with many different programming languages. It’s preferred by many developers interested in performing machine-learning analysis and running multiple programming languages in parallel on the same cluster.
Prior to the development of SQL Engines, writing code for execution on Hadoop required specialized knowledge. However, the advent of SQL engines on Hadoop allows organizations to run ANSI-compliant SQL against their Hadoop clusters. The introduction of these technologies allows business users to connect any SQL-based reporting tool to their Hadoop cluster.
Reliability and Performance
Hadoop, and specifically the Hadoop Distributed File System (HDFS), is commonly deployed within enterprise data stacks for two reasons: HDFS is designed to be remarkably fault-tolerant (data within the cluster will still be accessible despite machines in the cluster failing) and Hadoop’s MapReduce framework allows developers to process extremely large amounts of data.