Created 10-05-2015 04:17 PM
The customer wants to use something like Apache Drill to query HDP using JSON due to the fact that it's self-describing.
Created 10-05-2015 07:02 PM
One of the prospects recently evaluated Drill and while it worked for the structured / self-describing formats without creating schema, their experience was that the data type resolution aspect slowed the performance down. In any case, HWX does not support Drill officially so the on-us will be on customer to resolve any Drill related issues when using it with HDP.
On the other hand, my comment to customers is that Hive provides a consistent approach and in a way / semantics that is known to the database developers. Additionally, a larger community involvement and maturity of the product has hardened Hive over number of years.
JSONSerde is the easy to use way to handle JSON in HDP. In return of one time table creation, you get better performance as compared to Drill which does not seem like a bad trade off at all.
Created 10-05-2015 04:24 PM
Take a look at Spark (and SparkSQL). It can automatically infer the schema of a JSON dataset
https://spark.apache.org/docs/1.4.1/sql-programming-guide.html#json-datasets
Created 10-05-2015 04:49 PM
Apache Drill supports JSON as self describing data format, you can find the usage here. In Hive, HCatalog supports JSON as serde format for reading and writing data into tables.
Created 10-05-2015 07:02 PM
One of the prospects recently evaluated Drill and while it worked for the structured / self-describing formats without creating schema, their experience was that the data type resolution aspect slowed the performance down. In any case, HWX does not support Drill officially so the on-us will be on customer to resolve any Drill related issues when using it with HDP.
On the other hand, my comment to customers is that Hive provides a consistent approach and in a way / semantics that is known to the database developers. Additionally, a larger community involvement and maturity of the product has hardened Hive over number of years.
JSONSerde is the easy to use way to handle JSON in HDP. In return of one time table creation, you get better performance as compared to Drill which does not seem like a bad trade off at all.