Background - We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A. Problem statement - Since data on hdfs is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and reingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE SELECT * FROM OLDTABLE WHERE COL_A='abc' DELETE FROM OLDTABLE WHERE COL_A='abc' INSERT INTO NEWTABLE SELECT * FROM OLDTABLE WHERE COL_A='def' DELETE FROM OLDTABLE WHERE COL_A='def' . . . so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
On top of what Gomath has suggested, I also have another idea:
1. assuming that your current data on HDFS is
and you want it to become:
2. create a new table that has the same columns, but with extra partition columnB, assuming that you will create under HDFS in /user/hive/wareshouse/dbname/new_tablename
3. you can create a new directory with below structure:
4. create the top level partition directories for the new table
hadoop fs -mkdir /user/hive/wareshouse/dbname/new_tablename/columnA=1
hadoop fs -mkdir /user/hive/wareshouse/dbname/new_tablename/columnA=2
hadoop fs -mkdir /user/hive/wareshouse/dbname/new_tablename/columnA=3
hadoop fs -mkdir /user/hive/wareshouse/dbname/new_tablename/columnA=4
5. and then MOVE data from old table into new table's partition:
hadoop fs -mv /user/hive/wareshouse/dbname/tablename/columnA=1 /user/hive/wareshouse/dbname/new_tablename/columnA=1/columnB=1
hadoop fs -mv /user/hive/wareshouse/dbname/tablename/columnA=2 /user/hive/wareshouse/dbname/new_tablename/columnA=2/columnB=1
hadoop fs -mv /user/hive/wareshouse/dbname/tablename/columnA=3 /user/hive/wareshouse/dbname/new_tablename/columnA=3/columnB=1
hadoop fs -mv /user/hive/wareshouse/dbname/tablename/columnA=4 /user/hive/wareshouse/dbname/new_tablename/columnA=4/columnB=1
What's the value for columnB you will decide, since it is a new partition.
6. get back to hive and then run below command:
MSCK REPAIR TABLE dbname.new_tablename;
This will help to recover all the missing partitions
SELECT * FROM dbname.new_tablename
to confirm if data is correct.
This way, you do not need to duplicate any data on HDFS.
Hope above helps.