Support Questions

Find answers, ask questions, and share your expertise

Hadoop for Operational data store

avatar
Expert Contributor

I am asked to build an ODS (Operational data store) in hadoop for an insurance client. In this regard, few questions

  1. First of all, is it recommended to build the ODS in hadoop?
  2. What are the pros and cons of buildingODS in hadoop?
  3. Any best practices around this topic?

The ODS should facilitate the operational reporting needs that should support adhoc queries.

1 ACCEPTED SOLUTION

avatar

Great line of questioning @learninghuman!

Yes - we see a lot of customers attacking the ODS first with Hadoop in respect to EDW offload. There are a few motivations with this:

  1. All data is not tabular. A lot of areas of the business, such as R&D, aren't able to conform their data to canonical tabular representations in today's ODS. The ODS needs to manage a variety of source data and represent it in different ways, not just through tabular views - but also through index / search, modeling, and graph views. Hadoop is able to handle the variety of source data that is out there, and present it to the analyst through a number of different views - by applying schema on read, not on write. A very common use case that we see in Insurance is 360 degree view of customer - bringing together structured data from source systems with unstructured data (such as social media) from the outside world to mine for life events that may drive changes in coverage. Another common use case is an actuarial data lake - bringing structured and unstructured data together for modeling / machine learning purposes to better price insurance products at an individualized level.
  2. Current architectures that move the data in batches from source system to ODS to EDW are inefficient and brittle. Time to insight is a real motivation for most businesses we see considering Hadoop. As a business owner, I just can't wait as long as I have to wait today to get the insights I need. Hadoop allows me to ingest the data into my ODS more efficiently - in its raw form, as fast as it's produced. Secondly, ETL-based architectures today are brittle in that they end up dropping a lot of data before it gets into the ODS. This "loss of signal" can be really difficult to overcome when the business starts asking new questions. Hadoop brings agility to the ODS, as I can go back and replay my processing over that raw source data, transforming, enriching, aggregating new answers for different questions. A great example of this is the increased use of geo-location fields embedded in raw source system data to gain new insights downstream with adhoc queries.
  3. Data governance is often an after-thought in today's ODS architecture ... a "write-only" database of metadata. With Hadoop, I can profile and tag all the data on ingest - allowing me to address issues of data quality much more easily than today. I can then use that profiling and tagging downstream to effect certain processing (such as consistent handling of nulls, address normalization, encryption of PII fields, etc.) as well as uniformly restricting access downstream to that data by associating tags to groups and permissions. Finally, Hadoop architectures for the ODS push the processing to the data, rather than the data to the processing. This makes data lineage a lot easier to track within the context of your operational reporting, and eliminates a lot of reconciliation issues caused by today's disjoint approaches to data governance. With less data movement, and by holding and processing more data in one place, I can put real "teeth" into my data governance approach.

Perhaps the biggest con or anti-pattern here is "lift and shift" of current relational ODS architecture and processes to Hadoop. Yes - you can save money with Hadoop, but that doesn't excite the business nearly as much as becoming a more agile partner with them and helping them create new revenue opportunities along the way.

View solution in original post

6 REPLIES 6

avatar

Great line of questioning @learninghuman!

Yes - we see a lot of customers attacking the ODS first with Hadoop in respect to EDW offload. There are a few motivations with this:

  1. All data is not tabular. A lot of areas of the business, such as R&D, aren't able to conform their data to canonical tabular representations in today's ODS. The ODS needs to manage a variety of source data and represent it in different ways, not just through tabular views - but also through index / search, modeling, and graph views. Hadoop is able to handle the variety of source data that is out there, and present it to the analyst through a number of different views - by applying schema on read, not on write. A very common use case that we see in Insurance is 360 degree view of customer - bringing together structured data from source systems with unstructured data (such as social media) from the outside world to mine for life events that may drive changes in coverage. Another common use case is an actuarial data lake - bringing structured and unstructured data together for modeling / machine learning purposes to better price insurance products at an individualized level.
  2. Current architectures that move the data in batches from source system to ODS to EDW are inefficient and brittle. Time to insight is a real motivation for most businesses we see considering Hadoop. As a business owner, I just can't wait as long as I have to wait today to get the insights I need. Hadoop allows me to ingest the data into my ODS more efficiently - in its raw form, as fast as it's produced. Secondly, ETL-based architectures today are brittle in that they end up dropping a lot of data before it gets into the ODS. This "loss of signal" can be really difficult to overcome when the business starts asking new questions. Hadoop brings agility to the ODS, as I can go back and replay my processing over that raw source data, transforming, enriching, aggregating new answers for different questions. A great example of this is the increased use of geo-location fields embedded in raw source system data to gain new insights downstream with adhoc queries.
  3. Data governance is often an after-thought in today's ODS architecture ... a "write-only" database of metadata. With Hadoop, I can profile and tag all the data on ingest - allowing me to address issues of data quality much more easily than today. I can then use that profiling and tagging downstream to effect certain processing (such as consistent handling of nulls, address normalization, encryption of PII fields, etc.) as well as uniformly restricting access downstream to that data by associating tags to groups and permissions. Finally, Hadoop architectures for the ODS push the processing to the data, rather than the data to the processing. This makes data lineage a lot easier to track within the context of your operational reporting, and eliminates a lot of reconciliation issues caused by today's disjoint approaches to data governance. With less data movement, and by holding and processing more data in one place, I can put real "teeth" into my data governance approach.

Perhaps the biggest con or anti-pattern here is "lift and shift" of current relational ODS architecture and processes to Hadoop. Yes - you can save money with Hadoop, but that doesn't excite the business nearly as much as becoming a more agile partner with them and helping them create new revenue opportunities along the way.

avatar
Expert Contributor

@Tom McCuch Thanks for the detailed response. In terms of querying capabilities (from a BI tool or a CLI or Hue), to achieve faster query response as its required in the operational reporting, one way is to structure the data (by means of partition etc) for pre-defined queries but for adhoc operational reporting queries, whats your take on ODS in hadoop to achieve the desired performance? One way is restrict the volume of data (in addition to ORC format, Tez etc) in the ODS layer as its for operational needs anyways (so history may not be required). Please share your thoughts.

avatar

@learninghuman For supporting adhoc reporting queries, we recommend storing the raw data in Apache ORC and using Apache Hive to achieve the desired performance. ORC is a self-describing type-aware columnar file format designed for Hadoop ecosystem workloads. The columnar format lets the reader read, decompress, and process only the columns that are required for the current query. In addition, it has support for ACID transactions and snapshot isolation, build-in indexes and complex types. Many large Hadoop deployments rely on ORC, including those at Yahoo! and Facebook.

In addition to Partitioning and Bucketing with Hive, where each partition is physically a separate subdirectory under the table directory and each Bucket is physically a separate file within those subdirectories, ORC provides three level of indexes within each file:

  1. file level - statistics about the values in each column across the entire file
  2. stripe level - statistics about the values in each column for each stripe
  3. row level - statistics about the values in each column for each set of 10,000 rows within a stripe

Bucket pruning is not yet available with Hive so there are two ways to reduce the amount of data that needs to be processed:

  • Partitioning ( and partition pruning during joins). The limitation here is that too many partitions are hard on the hive server and the metastore so you shouldn't have more than a couple thousand partitions.
  • Predicate Pushdown in ORC. Meant to enhance partitioning, this allows the map task to open the ORC index and skip all stripes and blocks that cannot contain relevant data based on min/max values (always on ) and bloom filters ( if configured).

avatar
Expert Contributor

@Tom McCuch Thanks again. Do you recommend, data to be sorted for ORC optimization to work? Or it does not really matter? And any benchmark volume with performance testing done for adhoc queries with the optimization mentioned above?

avatar

@learninghuman

To maximize the capabilities of "Predicate Push Down" and the "Row Index", you should apply some type of secondary sorting to the data while inserting it. Where the "primary" filter is usually the "partition", sorting your data within the "partition" will increase the effectiveness of the row index structure, ensuring the clustering of records allows you to skip large sections of data.

Apache Hive is a very fast moving project, and Hive 2 only recently came out. I expect that updated performance benchmarks will come out over the course of the next year. With that said, here is a relevant independent benchmark performed on Hive 0.14 by Yahoo! Japan in September of last year. Hive has only gotten more performant in the year since this benchmark, but the biggest reason I share this one with you is that focus of Yahoo! Japan (and many other enterprise customers of ours) is not flat-out speed on individual queries, but mainly performance across multiple types of queries (ETL, reporting, adhoc) at high-levels of concurrency - which is what you would have if using Hive as an Operational Data Store as you are thinking.

avatar
Expert Contributor

@Tom McCuch Thanks a lot for the views and inputs. It definitely helps.