Support Questions

Find answers, ask questions, and share your expertise

Revisited : Import to Hive or HDFS ?

avatar
Super Collaborator

HDP 2.4 installed using Ambari 2.2.2.0.

To my previous question, I received comprehensive feedback from the community based on which I assumed that importing data from RDBMS to HDFS(text/Avro) and then create Hive external tables.

Then I realized that I have missed/misinterpreted something :

  1. The ideas behind importing first to HDFS are :
    1. When stored on HDFS, Hive as well as other tools(Pig, MR) and external/third-party tools can access the files and process in their own ways
    2. Sqoop cannot directly create the EXTERNAL tables, moreover, it is required that you load the data first onto the cluster and after some period, PARTITION the tables(when the db developers are available for business knowledge)
  2. A 1TB RDBMS is imported as text/Avro files onto HDFS, this will occupy approx. 3TB on the HDFS(given the replication factor of 3)
  3. Creating a Hive EXTERNAL table is NOT going to consume much HDFS space, I created 'raw/plain' EXTERNAL tables that merely point to the imported files
  4. NOW the confusion begins - I need to create EXTERNAL PARTITIONED tables from these 'raw/plain' tables. Now, the final EXTERNAL PARTITIONED tables will again occupy space, also due to the point 1.1, we CANNOT delete the original imported files. This will lead to more consumption of HDFS space, due to duplication of data

Are my fears justified ? If yes, how shall I proceed ? If not, what am I missing(say, HCatalog usage) ?

1 ACCEPTED SOLUTION

avatar
Guru

I don't see a reason for the first insert to be a text/uncompressed avro file. Using HCatalog, you can directly import from sqoop to hive table as ORC. That would save you a lot of space because of compression.

Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary. If the reason for writing as text is to access from Pig and MR, a HCatalog table also can be accessed from Pig/MR.

View solution in original post

3 REPLIES 3

avatar
Guru

I don't see a reason for the first insert to be a text/uncompressed avro file. Using HCatalog, you can directly import from sqoop to hive table as ORC. That would save you a lot of space because of compression.

Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary. If the reason for writing as text is to access from Pig and MR, a HCatalog table also can be accessed from Pig/MR.

avatar
Super Collaborator

Can you check if I have understood correctly :

  • Sqoop import(with HCatalog integration) to Hive
  • Use HCatalog in case someone needs to access and process the data in Pig, MR etc. I came across the following paragraph in O'Reilly(and the same tone reflected in several posts on the Internet)
A drawback of ORC as of this writing is that it was designed specifically for Hive, and so is not a general-purpose storage format that can be used with non-Hive MapReduce interfaces such as Pig or Java, or other query engines such as Impala. Work is under way to address these shortcomings, though

There will be several RDBMS schemas that will be imported onto HDFS and LATER partitioned etc. and processed. In this context, can you elaborate 'Once the initial data import is in Hive as ORC, you can then still continue and transform this data as necessary.'

I have the following questions :

  • Suppose Sqoop import to Hive is done WITHOUT partitions(--hive-partition-key) i.e all tables are Hive 'Managed' tables and , say, this uses 800GB of HDFS space as compared to 1TB in the source RDBMS. The question now is won't more space be occupied when I try to create PARTITIONED tables?
  • Will it be possible for some third-party non-java tool to read the data by relying on HCatalog ?

avatar
Guru

I am not sure what they mean by ORC not being a general purpose format. Anyway, in this case, you are still going through HCatalog (there are HCatalog APIs for MR and Pig).

When I said you can transform this data as necessary, I mean things like creating new Partitions, Buckets, Sorting, Bloom filters and even redesigning tables for better access.

There will be data duplication with any data transforms if you want to keep raw data as well.