Created on 12-20-2019 06:53 AM - last edited on 12-20-2019 08:49 AM by VidyaSargur
Hi Guys,
I got a request to copy one of the large hive external table that has 72883 partitions from one database to another database. I am new to the Hive database so I need your help in copying that table. I have done the following steps so far and not sure about next step.
Do I need to add those 72883 partitions manually or is there another way to add those partitions to the target table.
Your early response is very much appreciated.
Steps:
1. created the external table in new databases with the same definition of the source table,but with different location.
2. Copy the CSV files of source table location to target table location.
Regards
~Suresh D
Created 01-02-2020 09:06 AM
Great if all went well we usually run msck repair table daily once you have loaded a new partition in HDFS location.
Why you need to run msck Repair table statement every time after each ingestion?
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
msck will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist
If you will find the need remember to do that else accept the answer and close the thread :
Created 12-21-2019 06:47 AM
Best way to duplicate a partitioned table in Hive
HTH
Created 01-01-2020 11:06 AM
Any updates on this thread.
Created 01-02-2020 07:05 AM
Thank you for following up on this. Followed the below steps to copy the Hive external table successfully.
Steps:
1. Created the target external table in new databases with the same definition of the source table, but with different location. Extracted the definition of the source table with "show create table <Hive External source table>"
2. Copy the CSV files of source table location to target table location with "hadoop fs -cp"
3. Then added the partitions 72,883 partitions with the below Command. I had to come up with a script for 72,833 partitions though.
"ALTER TABLE <Hive External Target Table> ADD PARTITION <Partition Name>;
Regards
Suresh D
Created 01-02-2020 07:47 AM
Great that worked out better for you, did you run MSCK REPAIR TABLE table_name; on the target table?
f you found this answer addressed your initial question, please take a moment to login and click "accept" on the answer.
Happy hadooping
Created 01-02-2020 08:56 AM
Hi Shelton,
I have not run " MSCK REPAIR TABLE table_name; " on the target table . Not sure if it is needed. Why we have to run that command. My target table is working fine without running that command.
Regards
~Suresh D
Created 01-02-2020 09:06 AM
Great if all went well we usually run msck repair table daily once you have loaded a new partition in HDFS location.
Why you need to run msck Repair table statement every time after each ingestion?
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
msck will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist
If you will find the need remember to do that else accept the answer and close the thread :
Created 01-02-2020 09:15 AM
Thank you Shelton for sharing. I do not know about it. I will run this command instead of adding the partitions with alter table command next time.
Thank you again.
Suresh D
Created 01-02-2020 10:57 AM
Hi Shelton,
I did find a note to add the partitions in another way. Are you aware of it? if so, do you see any issues with it.
Regards
~Suresh D
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive-automate-msck.html
Hive automatically and periodically discovers discrepancies in partition metadata in the Hive metastore and corresponding directories on the file system, and then performs synchronization. Automating this operation for log data or data in Spark and Hive catalogs is especially helpful.
The discover.partitions table property enables and disables synchronization of the file system with partitions. In external partitioned tables, this property is enabled (true) by default when you create the table using Hive in HDP 3.1.4 and later. To a legacy external table (created using an earlier version of Hive), add discover.partitions to the table properties to enable partition discovery. By default, the discovery and synchronization of partitions occurs every 5 minutes, but you can configure the frequency as shown in this task.
ALTER TABLE exttbl SET TBLPROPERTIES ('discover.partitions' = 'true');