Support Questions

Find answers, ask questions, and share your expertise

Need to bring prod hive table data into test environment using distcp.

avatar

I need to bring prod hive table data into test hive table. Since it's a hadoop to hadoop, i can't use sqoop, hence i can use discp to transfer data across the clusters. But i have one more scenario to be handled while bringing data, that is filtering. Say i have 10 million records in prod hive table, i want to filter using some criteria and bring it to test table. is there a way to give filter parameters in distcp command on the fly? Or any other suggestions? Thanks in advance.

1 ACCEPTED SOLUTION

avatar
@Prabhu Muthaiyan

Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.

View solution in original post

5 REPLIES 5

avatar

You can use distcp -filters to ignore few path, patterns

Refer this:

http://www.ericlin.me/how-to-use-filters-to-exclude-files-when-in-distcp

hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path

avatar

Thanks Namit Maheshwari, data i am bringing into test is hive data, i need to filter using some criteria, like where condition in hive query. distcp -filters to exclude some files right, not on the data level. I want to filter the hive data using some criteria in production, and then want to bring the filtered data into test region.

avatar
@Prabhu Muthaiyan

Filter the data from hive prod and load it into a file and then as mentioned by @Namit Maheshwari use distcp to transfer between different environments. If you want to limit the data without any filters being applied filter only a set of files under a HDFS folder.

avatar

Thank you Bala!

avatar

@Prabhu Muthaiyan Glad that it helped you. Happy Hadooping!!