Member since
06-26-2013
416
Posts
104
Kudos Received
49
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 7736 | 03-23-2016 08:06 AM | |
| 13820 | 10-12-2015 01:56 PM | |
| 4920 | 03-05-2015 11:11 AM | |
| 6147 | 02-19-2015 02:41 PM | |
| 13460 | 01-26-2015 09:55 AM |
02-06-2014
06:53 AM
1 Kudo
You should be able to reinstall the packages via yum and be back in business. Your data (at least HDFS) should not have been removed in this process. yum will do this if the package you are trying to remove has dependencies. I believe there are command-line options to tell it to forget about the dependencies.
... View more
02-04-2014
07:31 AM
What process is responsible for the open sockets? Is it a local JVM, like an HBase regionserver, or some remote network IP? Can you paste a few example lines from the following output?
sudo netstat -anp | grep CLOSE_WAIT
Also, what version of CDH are you on? There can be multiple causes for an issue like you're seeing and there have been bugs fixed for this in the past with workarounds too.
Finally, can you give us the "$JAVA_HOME/bin/java -version" output? there was a JVM bug in ConcurrentMarkSweep GC in versions below 1.6.0_31, the workaround for that is to add the following JVM property to your runtime settings for the various hadoop daemons:
-XX:-CMSConcurrentMTEnabled
Newer versions of CM add that for you automatically.
... View more
01-21-2014
10:00 AM
Marc,
These are difficult challenges, to be sure, but I will attempt to address your concerns here:
1) The theoretic limit to how large your replication queue can grow is the size of your HDFS storage, since the items needing replication are stored in HBase WALs on HDFS. I'm not aware of any real world tests that have been done to prove that limit, though.
In your case it sounds like your replication queue would definitely grow very large very rapidly, so this is something to take into consideration. For an extended outage like you describe (eg. weeks) you might actually be better off just disabling replication on the surviving cluster and start over from scratch once you have restored the downed cluster. In other words, restore a fresh copy of your data to the new cluster and then enable replication again. Snapshots would be a good way to reasonably get a current image copied to the restored cluster. More on that below...
2) The backup and disaster recovery options I described in the blog do take "restore" into account and provide functionality for that, but they largely leave the implementation up to you. In other words, you will have to come up with a gameplan that works for your environment/infrastructure and test it. As I alluded above, I might recommend that you disable replication on the surviving cluster for an extended outage on the order of weeks. Once a cluster is available for replication again, you could export a snapshot of your existing cluster and use bulk loads to load the data into the new cluster. This can be done on a per-table basis, so you can choose your most critical tables first and enable replication on them at the same time in order to get your data mostly in sync. Note that there might still be some data that gets out of sync since your cluster is so active, this would require a bit of manual intervention to add/update the rows that are not in sync. However, once they are sync'd up, replication will take over again and you'll be fine. As you indicated, the CopyTable and Export functions would not be preferable as they would put a heavy MapReduce and HBase API load on your source cluster. Snapshots do not introduce such a load during their creation and although running "exportSnapshot" (or even using DistCP to copy the resulting snapshot to the restored cluster) will create a Mapreduce job, that job will not burden HBase as much as it will not make API calls to the tables at all.
3) As previously stated, I think your best bet here is to restore the most active tables first. Since you don't want to make the data in the restored cluster available for client requests until it's all sync'd up with the surviving cluster anyway (for data accuracy sake), it's really a moot point to attempt to restore data in a reverse order. Nonetheless, no such functionality exists that I'm aware of.
For an interesting read, see how Facebook did it.
... View more
01-21-2014
09:30 AM
2 Kudos
Yes, DistCP is usually what people use for that. It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS. Facebook developed their own replication layer, but it is proprietary to their engineering department.
... View more
01-20-2014
01:04 PM
3 Kudos
Cloudera Enterprise offers a backup and disaster recovery (BDR) tool which handles HDFS replication and other mechanisms like what you are seeking. I also wrote this blog entry regarding the different mechanisms that are available for HBase backup and disaster recovery. You didn't specify if you were using HBase, but that might help.
Some customers set up their user applications such that the data is written simultaneously to two clusters. This is a cheap form of replication. All data is written to cluster A and cluster B up front. You will have to write this code yourself and also make it fault tolerant, etc.
To answer your other questions, I would definitely recommend you have two independent clusters. One cluster spanning a WAN will not work very well, if at all.
... View more
01-19-2014
07:32 PM
Thank you for bringing these issues to our attention, I do believe we are aware of some of the issues already and are in the process of fixing them, but just to be certain, we will reach out to you personally to assure we fully understand all of the issues. Some of them might be specific to certain environmental conditions you are experiencing, but we definitely want to understand and resolve these user experience problems as best we can.
... View more
01-16-2014
08:30 AM
1 Kudo
I'll take a stab at addressing these questions:
1) Yes, you will need to shut down all hadoop services on all nodes before you perform a move like this, because HDFS will naturally attempt to re-replicate all the data that was residing on the 10 datanodes which you shut down. And since that would be half your cluster, it's likely that there would be some blocks that could not be re-replicated because the only copies of those blocks resided on the 10 nodes you shut down, so your HDFS would go into safe mode due to under-replicated/missing blocks. No risk of data loss, just not the way you'd like to do it.
2) If you properly shut down all services before doing the move, there is no risk of data loss. Just be sure your move doesn't entail giving the machines new IP addresses/hostnames, as this is an entirely different operation that requires a careful migration process.
3) yes
4) as stated in my response #1, you will get data replication churn in your cluster if you shut down individual datanodes. Cloudera Manager (enterprise) supports the notion of a rolling restart for your services if you'd like to maximize uptime, but otherwise you'll get the Namenode trying to re-replicate data if you stop one single node. After a certain timeout is reached, at least. I think you have several minutes before the blocks will begin to re-replicate to other nodes.
... View more
01-15-2014
09:33 AM
OK, it sounds like your yum repo is messed up. So here are the steps to get yum working and also pointing to the CDH4.4.0 packages:
1) download this repo file and place it in /etc/yum.repos.d (you should move your existing cloudera-cdh4.repo file to some other location for backup)
2) edit the file and modify the "baseurl" property to point to this URL:
http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4.4.0/
3) Now you should be able to use these instructions to install CDH4.4.0 by just doing "yum install" commands manually (Note, you'll have to scroll down the page to the section titled "Step 2: Install CDH4 with MRv1" and there you should see the yum commands that will work for you.
Please let me know if any other questions/concerns arise. Installing CDH manually is a little more involved that's why the documentation is so long.
... View more
01-15-2014
09:00 AM
Thanks for the feedback, I will file a JIRA with our docs team to have this updated.
Much appreciated!
... View more