Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Need to bring prod hive table data into test environment using distcp.

avatar

I need to bring prod hive table data into test hive table. Since it's a hadoop to hadoop, i can't use sqoop, hence i can use discp to transfer data across the clusters. But i have one more scenario to be handled while bringing data, that is filtering. Say i have 10 million records in prod hive table, i want to filter using some criteria and bring it to test table. is there a way to give filter parameters in distcp command on the fly? Or any other suggestions? Thanks in advance.

1 ACCEPTED SOLUTION

avatar
@Prabhu Muthaiyan

Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.

View solution in original post

5 REPLIES 5

avatar

You can use distcp -filters to ignore few path, patterns

Refer this:

http://www.ericlin.me/how-to-use-filters-to-exclude-files-when-in-distcp

hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path

avatar

Thanks Namit Maheshwari, data i am bringing into test is hive data, i need to filter using some criteria, like where condition in hive query. distcp -filters to exclude some files right, not on the data level. I want to filter the hive data using some criteria in production, and then want to bring the filtered data into test region.

avatar
@Prabhu Muthaiyan

Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.

avatar

Thank you Bala!

avatar

@Prabhu Muthaiyan Glad that it helped you. Happy Hadooping!!