Reply
Contributor
Posts: 49
Registered: ‎04-21-2015

parquet file copy on partitioned tables

Hello

 

I am trying to copy parquet files between 2 different Impala database that they are working 2 different Cloudera installation.

 

I created tables on target with exactly same type of columns and partitions

I copied parquet files from source and copied into target fs then moved into hdfs location and then I run refresh <table> command on impala-shell

 

For nonpartitioned tables :--> there is no problem after refresh command. I could see all the data in target tables

 

For partitioned tables --> there is no data after refresh command, so I tried with following procedure

 

  • dropped partitioned tables from target,
  • recreated tables on target without partition. I created source table's partitioned columns as normal columns on target
  • copied parquet files into hdfs location
  • then refresh table command

 

When I check the target table, I could see data in target table except on some column/s of target table which they are created as partitioned columns on source table

 

What is the proper way of parquet file copy operation between 2 impala database, if the tables have partitions?

 

Thanks

 

 

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: parquet file copy on partitioned tables

For partitioned tables, you need to also re-create the partition metadata in the target table. Just  doing a refresh will not re-create that metadata. Here are a few options for re-creating the partitions in the new table:

1. Use Impala's "ALTER TABLE new_table RECOVER PARTITIONS" (available in Impala 2.3)

2. Use Hive's "msck repair" to recover the partition metadata

3. Re-create the partitions manually with ALTER TABLE new_table ADD PARTITION(...)

 

 

Contributor
Posts: 49
Registered: ‎04-21-2015

Re: parquet file copy on partitioned tables

Hello Alex

 

Thanks for your reply

 

1. ) I am running with following impala and CDH version. Can I easily upgrade only impala ?
impala.x86_64                  2.2.0+cdh5.4.5+0-1.cdh5.4.5.p0.8.el6
impala-shell.x86_64            2.2.0+cdh5.4.5+0-1.cdh5.4.5.p0.8.el6

 

2.) I applied following steps with partitioned and nonpartitioned table but I am failed again

  • dropped table
  • recreated table without partitioned columns
  • copied parquet files into same directory (I collected directory information from show create table output)
  • run msck repair table <tablename> in hive CLI
  • there is no data in partitioned columns

 

  • dropped table
  • recreated table with partitioned columns
  • copied parquet files into same directory
  • run msck repair table <tablename> in hive CLI
  • no data in table

I think I have an error about the steps but could not find yet. Any advice ?

 

3.) I am getting many parquet tables from external source and I have many tables with various partitions (not only datetime info) and I am trying to create an automatization. So I need a a little bit easier way to  automate it. For now, I have focused 1 and 2 :)

 

Many thanks

Suluhan