I am toying with the idea of re-engineering my legacy in-house built distributed processing system to Hadoop.
Background: The system has been around for about 30 years, and its current incarnation was is about 10 years ago. At that time we migrated to a column-store database, which gave us a huge boost in speed.
Our processing is driven by numerous complex mathematical models. It therefore occurs externally to our database. We are using our columnar store DB as the authoritative source for data.
We are doing derivative processing. So this means we take one or more columns and derive either a new concept or an improvement on the current concept. Either way the derivative is stored as a new column.
For the most part, when we start the process we know 99% of the columns that we will create, and we create the complete data model needed at the beginning of the process. However, as with anything there are occasionally surprised that dictate data model changes.
Our database is a typical entity relationship model. It looks very similar to any typical database. The data volume is low. Tables consist of as little as 50,000 rows to as much as 10s of millions. The smallest tables have just a handful of columns, but the larger tables can have over a thousand columns.
To extract data we typically have to join multiple tables together. The result of the extraction is processed externally. The results of the processing are uploaded to existing columns in the database. A quarter of jobs upload to empty columns, the rest overwrite an existing value. A job can extract as few as a single column of interest to as many as several hundred. Typically we upload 10 or fewer new columns to the database at the end of a job. All loading is done using the keys that were part of the original extraction.
The unique part about this is the lifecycle of extract, process, and load being repeated over and over. Those columns that were just loaded, may be extracted in the subsequent job. We actually exceed 20,000 hours of CPU time, and just as many hours of human time for analysis of the results.
Two years ago, we piloted a version of our processing system built upon an Industry leading RDBMS that was not columnar. In that exercise we proved without a doubt that uploading columns had a huge cost in a row-based RDBMS. In this scenario we often found that the complete table had to be rewritten. Our data is coded and even with a thousand columns a complete row typically fit into a single data block in that implementation.
The question: Do any of the HADOOP columnar stores (Parquet, ORC, CarbonData) work well with a high volume of writes? We are a batch processing system that is run in parallel with OLAP.
Thanks for your detailed requirements. Well done question.
While not nail on your fence, read this thread:
Also, check Vamsi's blog: http://www.vamsitalkstech.com/?p=35
Bottom line, I would go with HBase and I would use Phoenix as SQL on HBase. It does satisfy your requirements for high volume of writes, it provides a flexible schema (add more columns as needed), it can provide SQL access via Phoenix. Plus since you have a lot of empty spaces, being a key-value it will only for not null values. Additionally, for your mathematical models, you could use Spark via Phoenix to access HBase.