Hi all, a nooby question here!
I'm facing the a challenge and was wondering if I could get some advice here... I have two Hadoop clusters in my organisation. One runs version 2.6.x and the other runs version 3.0.x. Now I want to start a migration from the 2.6.x cluster to the 3.0.x cluster in small steps to prevent downtime for our clients as much as possible. One of the reasons is to lower the storage overhead because of EC in Hadoop 3.
First we want to move data older than X amount of months. But we want to be able to still read the data from the services (hive etc) running in the old cluster (both are kerberized and can access from both sides clusters contents of the other side). With this step we face the following issues:
EC is not as transparant as we thought, we have to use distcp from the hdp3 cluster to move data over to an EC folder. Distcp from the hdp2 cluster gives all kind of failures (failed to close file, next block not found, could not read block x, not enough replica's etc etc). But after using the newer distcp, we are not able to read the data from the hdp2 cluster anymore without all kind of errors related to blocks and blockpools.
Is there any other way to achieve this? Or is a full upgrade of the hdp2 cluster really necessary (which we don't want because of cluster size and downtimes that are involved with this approach) to be able to move data properly between both clusters? Or is this just misconfiguration and is fine-tuning of some parameters needed to make an HDFS EC folder compatible with hdfs clients from HDP2 cluster?
hdfs dfs -ls works from hdp2 on hdp3 EC folder
hdfs dfs -put works from hdp2 on hdp3 EC folder but ignores EC and forces file with replicationfactor
hdfs dfs -get from hdp2 on dhp3 EC folder does not work because of block not found errors.
hive select query works from hdp2 on dataset located in hdp3 non-ec folder but does not work when located in an EC folder.
The issues are only with EC folders. NON-EC folders work cross clusters on all the services (hive/mapred/etc) without any issues. As hdp2 is our main cluster, we want to aim to use the services like hive and mapred on this cluster but leveraging the data on hdp3 EC enabled storage only cluster. All the problems can be solved by introducing also the yarn/mapred/hive etc on the new cluster, but we want to prevent that for the short term as much as possible, as most of our computing servers are located in hdp2 cluster.
Thanks in advance!