About epowell

epowell · ‎11-03-2017

Just thought of two quick things to add to the discussion: First, I crossposted this issue on StackOverflow late yesterday prior to receiving any responses on this thread. I will update both posts with the solution to prevent any duplicated effort. (That thread has not received any responses so far). Second, in the screenshot I noticed that in the 'Federation and High Availability' section is an item that controls 'Automatic Failover' and in my case it says it is not enabled. This sheds some light on why my cluster is still down despite all the documentation on HA mentioning the automatic failover feature. Should I just try clicking 'Enable' for Automatic Failover? (I have made sure to back everything in the current/ dir of the other namenode.)

epowell · ‎11-03-2017

Thank you for your response. I think the other namenode may be started but things are such a mess that I can't be sure. I've attached a screenshot from CM. This is after starting Zookeeper and HDFS. (I didn't attempt to start the entire cluster this time but I'm pretty sure the result is the same.) The first line seems to show the other namenode as 'Started' (as well as the data nodes). However, if I go to that node and attempt to run any `hdfs` commands, here is what I get: ubuntu@ip-10-0-0-154:~$ hdfs dfs -ls / 17/11/03 14:20:15 WARN retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB after 1 fail over attempts. Trying to fail over after sleeping for 1126ms. 17/11/03 14:20:16 WARN retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB after 2 fail over attempts. Trying to fail over after sleeping for 1373ms. 17/11/03 14:20:17 WARN retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB after 3 fail over attempts. Trying to fail over after sleeping for 4470ms.

epowell · ‎11-02-2017

After attempting a large "insert as select" operation, I returned this morning to find that the query had failed and I could not issue any commands to my cluster this morning (e.g. hdfs dfs -df -h). When logging into CM, I noticed that most nodes had an health issue related to "clock offset". At this point, I am only concerned about trying to recover the data on HDFS. I am happy to build a new cluster (given that I am on CDH4, anyway) and migrate the data to that new cluster. I tried to restart the cluster but the start-up step failed. Specifically, it failed to start the HDFS service and reported this error in Log Details: Exception in namenode join java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data0/dfs/nn state: NOT_FORMATTED at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:295) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:207) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:741) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:531) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:445) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:621) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:606) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241) Below are some more details that I have gathered about the situation. I am running CDH4 There are two namenodes in the cluster. One reporting the errors above and another one which reports Unable to trigger a roll of the active NN java.net.ConnectException: Call From ip-10-0-0-154.ec2.internal/10.0.0.154 to ip-10-0-0-157.ec2.internal:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused If I log into the first name node, the one with the initial error, and try to look at the namenode directory, it is completely empty ubuntu@ip-10-0-0-157:~$ sudo ls -a /data0/dfs/nn/ . .. ubuntu@ip-10-0-0-157:~$ sudo ls -a /data1/dfs/nn/ . .. If I log into the other name node, it has data in those directories ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data0/dfs/nn/ total 12K drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 . drwxr-xr-x 3 root root 4.0K Jun 6 2015 .. drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 current ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data1/dfs/nn/ total 12K drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 . drwxr-xr-x 3 root root 4.0K Jun 6 2015 .. drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 current ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data0/dfs/nn/current total 13M drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 . drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 .. -rw-r--r-- 1 hdfs hdfs 697 Jun 6 2015 edits_0000000000000000001-0000000000000000013 -rw-r--r-- 1 hdfs hdfs 1.0M Jun 6 2015 edits_0000000000000000014-0000000000000000913 -rw-r--r-- 1 hdfs hdfs 549 Jun 6 2015 edits_0000000000000000914-0000000000000000923 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000924-0000000000000000937 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000938-0000000000000000951 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000952-0000000000000000965 -rw-r--r-- 1 hdfs hdfs 1.8K Jun 6 2015 edits_0000000000000000966-0000000000000000987 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000988-0000000000000001001 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001002-0000000000000001015 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001016-0000000000000001029 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001030-0000000000000001043 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001044-0000000000000001057 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001058-0000000000000001071 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001072-0000000000000001085 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001086-0000000000000001099 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001100-0000000000000001113 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001114-0000000000000001127 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001128-0000000000000001141 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001142-0000000000000001155 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001156-0000000000000001169 -rw-r--r-- 1 hdfs hdfs 1.0M Jun 6 2015 edits_inprogress_0000000000000001170 -rw-r--r-- 1 hdfs hdfs 5.1M Nov 2 08:49 fsimage_0000000000024545561 -rw-r--r-- 1 hdfs hdfs 62 Nov 2 08:49 fsimage_0000000000024545561.md5 -rw-r--r-- 1 hdfs hdfs 5.1M Nov 2 09:49 fsimage_0000000000024545645 -rw-r--r-- 1 hdfs hdfs 62 Nov 2 09:49 fsimage_0000000000024545645.md5 -rw-r--r-- 1 hdfs hdfs 5 Jun 6 2015 seen_txid -rw-r--r-- 1 hdfs hdfs 170 Nov 2 09:49 VERSION

epowell · ‎06-28-2017

Yes, thank you. That did the trick. So, basically, my procedure was as follows. Add the cloudera repository containing the hadoop binaries sudo vim /etc/apt/sources.list.d/cloudera-manager.list Install the binaries. I used hadoop-client package and that was enough. sudo apt-get install hadoop-client Install Java. This made the error about $JAVA_HOME go away even though I still don't have that env set. sudo apt-get install openjdk-6-jre-headless Copy the config files from a functioning cluster node sudo scp user@cluster:/etc/hadoop/conf/* /etc/hadoop/conf

epowell · ‎06-27-2017

Sorry. I'm really, really new at Hadoop and Cloudera. I actually have no idea where any of the config files are on the cluster nodes. Can you give me the paths to the config files or tell me where I can find them? Or maybe you could tell me the filenames and I'll search with `find`. Whatever is easiest...

epowell · ‎06-26-2017

@mbigelow this is really exciting. Thanks for following up on this thread. I am way back on CM 4.8.5 and CDH 4. Nevertheless, I downloaded and installed the repos similar to what you mention in step two. Step three is a little foggy for me. Can you elaborate on "updated the configs manually"? Specifically, which configs should I copy over? Thanks again. This would be a huge win for me if this works.

epowell · ‎06-16-2017

Thanks a bunch! That's kind of what I feared. Glad I didn't go down too much of a rabbit hole yet. For what it's worth, the process is really easy to get `impala-shell` working. I just added the Cloudera repository, installed with apt-get and then put the ip address of an impala node into the commands. I was hoping to get something similar for hdfs. Oh well. I'll look into what you said.

epowell · ‎06-16-2017

I should refine my question. Part of what prompted this is that I noticed on the nodes of the cluster that $JAVA_HOME is not defined either. That makes me think that there might be certain configuration files on the cluster nodes that maybe I can copy over to my development environment.

epowell · ‎06-16-2017

I would like to be able to perform hdfs commands from a computer that is NOT actually part of the cloudera cluster. For example, performing simple put/get operations or hdfs dfs -ls /my/dir I have installed the correct binaries, I think. I found from CM that I was using CDH4.7.1. So, I installed (sudo apt-get install hadoop-client) the binaries from here. If I run: hdfs dfs -ls / I get: Error: JAVA_HOME is not set and could not be found. I feel that this might just be the beginning of a long tinkering and configuring process and I unfortunately know nothing about java. I do, however, know the IPs of the namenodes on my cluster and have access to all admin rights from beginning to end. Can someone help me get things configured? P.S. In case it wasn't clear. I can perform all desired functionality on nodes that are part of the cluster. I just want to do something similar from my development environment.

epowell · ‎04-18-2017

Thank you, saranvisa. I think that basically we are a very light system, effectively a 1 to n relationship because of so few users right now. I checked the memory usage of all the nodes in the cluster and it is around 2GB for virtually all of them, except for the one running cloudera manager which is at 10GB. Here is a screenshot showing about 90% of the nodes. I am not sure why they say Unknown Health. I think there are errors in the DNS checks. Does everything else look OK to you? Thank you so much for your help and quick response. I really, really appreciate it!

Online	Offline
Last Visited	‎03-29-2019 01:55 PM

Member Since	‎04-13-2017 04:09 PM
Last Visited	‎03-29-2019 01:55 PM
Posts	46
Kudos received	4

Cloudera Community

Re: Impala queries through impala JDBC is running ...

Re: Cannot start an HA namenode with name dirs tha...

Re: ls: Operation category READ is not supported i...

Re: Cannot start an HA namenode with name dirs tha...

Re: Cannot start an HA namenode with name dirs tha...

Cannot start an HA namenode with name dirs that ne...

Re: Configure hadoop-client tools to access hdfs f...

Re: Configure hadoop-client tools to access hdfs f...

Re: Configure hadoop-client tools to access hdfs f...

Re: Configure hadoop-client tools to access hdfs f...

Re: Configure hadoop-client tools to access hdfs f...

Configure hadoop-client tools to access hdfs from ...

Re: Low memory Impala nodes (e.g. 15GB RAM)