Created 08-16-2016 09:48 AM
I am asked to build an ODS (Operational data store) in hadoop for an insurance client. In this regard, few questions
The ODS should facilitate the operational reporting needs that should support adhoc queries.
Created 08-17-2016 05:09 PM
Great line of questioning @learninghuman!
Yes - we see a lot of customers attacking the ODS first with Hadoop in respect to EDW offload. There are a few motivations with this:
Perhaps the biggest con or anti-pattern here is "lift and shift" of current relational ODS architecture and processes to Hadoop. Yes - you can save money with Hadoop, but that doesn't excite the business nearly as much as becoming a more agile partner with them and helping them create new revenue opportunities along the way.
Created 08-17-2016 05:09 PM
Great line of questioning @learninghuman!
Yes - we see a lot of customers attacking the ODS first with Hadoop in respect to EDW offload. There are a few motivations with this:
Perhaps the biggest con or anti-pattern here is "lift and shift" of current relational ODS architecture and processes to Hadoop. Yes - you can save money with Hadoop, but that doesn't excite the business nearly as much as becoming a more agile partner with them and helping them create new revenue opportunities along the way.
Created 08-18-2016 09:45 AM
@Tom McCuch Thanks for the detailed response. In terms of querying capabilities (from a BI tool or a CLI or Hue), to achieve faster query response as its required in the operational reporting, one way is to structure the data (by means of partition etc) for pre-defined queries but for adhoc operational reporting queries, whats your take on ODS in hadoop to achieve the desired performance? One way is restrict the volume of data (in addition to ORC format, Tez etc) in the ODS layer as its for operational needs anyways (so history may not be required). Please share your thoughts.
Created 08-19-2016 12:30 PM
@learninghuman For supporting adhoc reporting queries, we recommend storing the raw data in Apache ORC and using Apache Hive to achieve the desired performance. ORC is a self-describing type-aware columnar file format designed for Hadoop ecosystem workloads. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query. In addition, it has support for ACID transactions and snapshot isolation, build-in indexes and complex types. Many large Hadoop deployments rely on ORC, including those at Yahoo! and Facebook.
In addition to Partitioning and Bucketing with Hive, where each partition is physically a separate subdirectory under the table directory and each Bucket is physically a separate file within those subdirectories, ORC provides three level of indexes within each file:
Bucket pruning is not yet available with Hive so there are two ways to reduce the amount of data that needs to be processed:
Created 08-22-2016 06:14 AM
@Tom McCuch Thanks again. Do you recommend, data to be sorted for ORC optimization to work? Or it does not really matter? And any benchmark volume with performance testing done for adhoc queries with the optimization mentioned above?
Created 08-23-2016 03:35 PM
To maximize the capabilities of "Predicate Push Down" and the "Row Index", you should apply some type of secondary sorting to the data while inserting it. Where the "primary" filter is usually the "partition", sorting your data within the "partition" will increase the effectiveness of the row index structure, ensuring the clustering of records allows you to skip large sections of data.
Apache Hive is a very fast moving project, and Hive 2 only recently came out. I expect that updated performance benchmarks will come out over the course of the next year. With that said, here is a relevant independent benchmark performed on Hive 0.14 by Yahoo! Japan in September of last year. Hive has only gotten more performant in the year since this benchmark, but the biggest reason I share this one with you is that focus of Yahoo! Japan (and many other enterprise customers of ours) is not flat-out speed on individual queries, but mainly performance across multiple types of queries (ETL, reporting, adhoc) at high-levels of concurrency - which is what you would have if using Hive as an Operational Data Store as you are thinking.
Created 08-25-2016 08:50 AM
@Tom McCuch Thanks a lot for the views and inputs. It definitely helps.