Support Questions
Find answers, ask questions, and share your expertise

Hadoop distcp copy latest file in hive table partition directory

Hadoop distcp copy latest file in hive table partition directory

New Contributor

Hi everyone,

 

I was actually trying copy hdfs files from hive table which is multi partitioned. I was receiving old partition files which are giving difference in source table data and local path for most recent partitions.
Example:

/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-28/1739

/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-28/1740

/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-29/1234

/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-29/1235

/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-29/1236

 

From the above files, there're 3 directories for 2020-06-29 partition and "/usr/hive/warehouse/sample_db.db/test_table/country=US/dt=2020-06-29/1236" is the latest file which is pointed for the partition. So, I was trying to fetch latest file for each partition before copying data from hdfs to s3.

I've tried the below command but it's giving only one latest file. 

hdfs dfs -ls -R /usr/hive/warehouse/sample_db.db/test_table/* | awk '{if ($5 > 0) print $8}' | sort -n | tail -1

Is there anyway I can get latest file in each partition directory?