We looked at the performance that these engines have in the last article, now it’s time to look at how the data got loaded in. There are trade-offs here to be aware of when loading data into each of these engines, as they use different mechanisms to accomplish the task.
There is no immediate need for a Schema-on-Write data load when you are using Hive with your native file format. Your only “load” operation is copying the files from your local file system to HDFS. With schema-on-read functionality, Hive can instantly access data as soon as its’ underlying file is loaded into HDFS. In our case the real data load step was in converting this Schema-on-Read external Hive table data into optimized ORC format, therefore loading it from an external table to a Hive-managed table. This was a relatively short process, coming in much under an hour.
Contrast that with HBase, where a bulk data load for our sample data set of 200M rows (around 30GB of disk size in CSV format) took 4+ hours using a single-threaded Java application running in the cluster. In this case, HBase went through a process of taking several columns of the CSV data and concatenating them together to come up with a composite key. This, along with the fact that the inserts were causing hot-spotting within the Region Servers, slowed things down. One way to improve this performance would be to pre-split the regions so your inserts aren’t all going to one region to start with. We could have parallelized the data load as well to improve the performance, writing a MapReduce job to distribute the work.
Let’s also contrast that with the Druid load, which took about 2 hours. Druid bulk loads data using a MapReduce job; this is a fairly efficient way of doing things since it distributes the work across the cluster and is why we’re seeing a lower time relative to HBase. Druid still has to do the work of adding its own indexes on top of the data and optionally pre-aggregate the data to a certain user-defined level, so it doesn’t have a trivial path to getting the data in either. Although we didn’t choose to pre-aggregate this data, this is what allows Druid to save a lot of space; instead of storing the raw data, Druid can roll the data up to a minute-level granularity if you think your users will not query deeper than that. But remember - Once you aggregate the data, you no longer have the raw data.
Another interesting way to slice this data is by how much space it takes up in each of the 3 columnar formats.
Size on Disk with Replication
Hive - ORC w/ Zlib
HBase - Snappy compression
Hive and Druid have compressed the data very efficiently considering the initial data size was 90GB with replication, but HBase is sitting right around the raw data size.
At this point, we've covered both relative loading times for the three engines as well as data storage space requirements across the three. These may change as you use different compression formats or load different kinds of data into the engines, but this is intended as a general reference to understand relative strengths between the three.