Created 04-06-2017 05:31 PM
I need to bring prod hive table data into test hive table. Since it's a hadoop to hadoop, i can't use sqoop, hence i can use discp to transfer data across the clusters. But i have one more scenario to be handled while bringing data, that is filtering. Say i have 10 million records in prod hive table, i want to filter using some criteria and bring it to test table. is there a way to give filter parameters in distcp command on the fly? Or any other suggestions? Thanks in advance.
Created 04-06-2017 07:12 PM
Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.
Created 04-06-2017 05:41 PM
You can use distcp -filters to ignore few path, patterns
Refer this:
http://www.ericlin.me/how-to-use-filters-to-exclude-files-when-in-distcp
hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path
Created 04-06-2017 06:51 PM
Thanks Namit Maheshwari, data i am bringing into test is hive data, i need to filter using some criteria, like where condition in hive query. distcp -filters to exclude some files right, not on the data level. I want to filter the hive data using some criteria in production, and then want to bring the filtered data into test region.
Created 04-06-2017 07:12 PM
Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.
Created 04-07-2017 12:50 PM
Thank you Bala!
Created 04-07-2017 12:58 PM
@Prabhu Muthaiyan Glad that it helped you. Happy Hadooping!!