Support Questions

Find answers, ask questions, and share your expertise

Primary key partition handling from HAWQ to hive

avatar
Expert Contributor

I understand HAWQ can handle Primary key partition. In the HAWQ to Hive migration what is the best suited approach to handle data ingestion ?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Yogeshprabhu

Multiple nested sub partition from HAWQ to HIVE using sqoop will be challenging, if you need to implement then we need to use SQOOP2 API's. I would recommend to import the table as it is with one parent partition in to HDFS. Create a external table and migrate it to internal table with necessary required partition.

Please remember having large partition with small amount of data @ hive might hinder the performance.

View solution in original post

8 REPLIES 8

avatar
Expert Contributor
@Yogeshprabhu

Move the HAWQ primary key check constraint to the data ingestion script. suppose in case of sqoop, use custom Query handler to get only the check constraint data and create a child table in Hive. In this way you can acheive same schema structure between HAWQ and HIVE.

CONSTRAINT rank_1_prt_2_check CHECK (year >= 2001 AND year < 2002)
)
INHERITS ("Test".rank)



Move this constraint to the Sqoop Script condition create a separate hive tables for each HAWQ child tables.

avatar
Expert Contributor

Thanks @njayakumar

Incase of nested partitions how will that be handled ?

avatar
Expert Contributor

@Yogeshprabhu

Multiple nested sub partition from HAWQ to HIVE using sqoop will be challenging, if you need to implement then we need to use SQOOP2 API's. I would recommend to import the table as it is with one parent partition in to HDFS. Create a external table and migrate it to internal table with necessary required partition.

Please remember having large partition with small amount of data @ hive might hinder the performance.

avatar
Contributor

What is a "Primary key partition"? I've never heard of that before.

- HAWQ doesn't support indexes so it doesn't support a Primary Key constraint.

- HAWQ does have table partitioning which can be a list or a range of values. Most commonly, a partition will be based on a date or timestamp column and have a range of entire month, quarter, or year. This is similar to Hive partitioning but Hive can't partition on a range of values like HAWQ can.

- HAWQ also has table distribution which can be either random or a hash of a column or columns. With HAWQ 2.0, it is recommended to use random distribution.

So your question is how to migrate data from HAWQ to Hive. First off, sqoop would be pretty slow. It would be a single process to unload data. I would never recommend using sqoop for something like this. Instead, you should use a Writable External Table in HAWQ that writes, in parallel, directly to HDFS.

http://hdb.docs.pivotal.io/20/pxf/PXFExternalTableandAPIReference.html

avatar
Expert Contributor
@Jon Roberts

sqoop can be run in parallel based on split by coloumn id or externally providing the number of mapper.

Majority of the places, HAWQ will be managed by a different team, creating the external table involves lot of process changes.

Not sure how HAWQ will handle HDFS write, in case of secured cluster.

avatar
Contributor

HAWQ's external table supports secured clusters. But I would always prefer to just create a single external table and then run a typical "INSERT INTO ext_table SELECT * FROM hawq_table;" because it would be so much faster to work with.

Speaking of speed, I haven't heard of anyone moving data from HAWQ to Hive. It is always the other way round! HAWQ is so much faster and has better SQL support than Hive.

avatar
Expert Contributor
@Jon Roberts

Could you please elaborate on external table supports secured clusters? Not sure how HAWQ handles HDFS write to different secured hadoop cluster using external writable table.

Thanks in advance.

avatar
Contributor