Created 07-07-2016 09:20 AM
HDP 2.4 installed using Ambari 2.2.2.0.
To my previous question, I received comprehensive feedback from the community based on which I assumed that importing data from RDBMS to HDFS(text/Avro) and then create Hive external tables.
Then I realized that I have missed/misinterpreted something :
Are my fears justified ? If yes, how shall I proceed ? If not, what am I missing(say, HCatalog usage) ?
Created 07-07-2016 12:20 PM
I don't see a reason for the first insert to be a text/uncompressed avro file. Using HCatalog, you can directly import from sqoop to hive table as ORC. That would save you a lot of space because of compression.
Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary. If the reason for writing as text is to access from Pig and MR, a HCatalog table also can be accessed from Pig/MR.
Created 07-07-2016 12:20 PM
I don't see a reason for the first insert to be a text/uncompressed avro file. Using HCatalog, you can directly import from sqoop to hive table as ORC. That would save you a lot of space because of compression.
Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary. If the reason for writing as text is to access from Pig and MR, a HCatalog table also can be accessed from Pig/MR.
Created 07-07-2016 12:50 PM
Can you check if I have understood correctly :
A drawback of ORC as of this writing is that it was designed specifically for Hive, and so is not a general-purpose storage format that can be used with non-Hive MapReduce interfaces such as Pig or Java, or other query engines such as Impala. Work is under way to address these shortcomings, though
There will be several RDBMS schemas that will be imported onto HDFS and LATER partitioned etc. and processed. In this context, can you elaborate 'Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary.'
I have the following questions :
Created 07-07-2016 04:24 PM
I am not sure what they mean by ORC not being a general purpose format. Anyway, in this case, you are still going through HCatalog (there are HCatalog APIs for MR and Pig).
When I said you can transform this data as necessary, I mean things like creating new Partitions, Buckets, Sorting, Bloom filters and even redesigning tables for better access.
There will be data duplication with any data transforms if you want to keep raw data as well.