Member since
05-22-2018
40
Posts
0
Kudos Received
0
Solutions
09-17-2018
04:26 PM
I did try to rebalance. did not copy data to new ones
... View more
09-17-2018
04:15 PM
Right but that would not ensure all of the data off the decomissioned nodes and to the live nodes. right?
... View more
09-17-2018
01:32 PM
I thought decommissioning all of them would automatically copy data to commissioned nodes. That did not happen though. I currently have 4 DN's in the cluster. I need to remove all 4 of them add 3 new ones instead. I need to maintain the data ofcourse. What's the way forward?
... View more
- Tags:
- data-nodes
09-12-2018
12:48 PM
No configuration changed when I started getting
Namenode Connectivity: IssueThis DataNode is not connected to one or more of its NameNode(s).
Web server status: The Cloudera Manager Agent is not able to communicate with this role's web server. Datanode is not connected to one or more of its Namenode. Also, I start getting web server status error that Cloudera agent is not getting a response from its web server role. This is what the log looks like: dwh-worker-4.c.abc-1225.internal ERROR September 12, 2018 5:33 PM DataNode dwh-worker-4.c.abc-1225.internal:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.31.10.74:44280 dst: /172.31.10.74:50010 java.io.IOException: Not ready to serve the block pool, BP-1423177047-172.31.4.192-1492091038346. at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAndWaitForBP(DataXceiver.java:1290) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1298) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:630) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) also, the data nodes are randomly exiting: dwh-worker-1.c.abc-1225.internal:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.31.10.74:49848 dst: /172.31.4.147:50010
java.io.IOException: Not ready to serve the block pool, BP-1423177047-172.31.4.192-1492091038346.
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAndWaitForBP(DataXceiver.java:1290)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.checkAccess(DataXceiver.java:1298)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:630)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
at java.lang.Thread.run(Thread.java:745)
... View more
- Tags:
- datanodes
- Hadoop Core
Labels:
- Labels:
-
Apache Hadoop
09-04-2018
05:26 AM
Also, I can see that I have over 100 connections on the port as of now.Where can I take the limit off from for the allowed number of connections?
... View more
09-04-2018
05:19 AM
That's probably the case. Could you share how did you resolve the problem? I don't see any problem in HS2 Logs though. However, I see a jump in open connections of HiveServer2. Would be great if you could share how you resolved it/
... View more
09-01-2018
06:51 AM
We recently started using Tableau and allowed Tableau online to access our hive server. Now, from that point, for about 2 hours in morning 10 to 12, our hive queries fail with Connecting to jdbc:hive2://ip-xxx-xx-x-xxx.ap-south-1.compute.internal:10000/default Unknown HS2 problem when communicating with Thrift server. Error: Could not open client transport with JDBC Uri: jdbc:hive2://ip-xxx-xx-x-xxx.ap-south-1.compute.internal:10000/default: java.net.SocketException: Connection reset (state=08S01,code=0) No current connection Intercepting System.exit(2) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2] If I try to connect manually, I can connect fine: beeline -u jdbc:hive2://xxx-xx-x-xxx:10000/ I then have to take off tableau whitelisted IP and after about 10 minutes, it comes back up. We do not have a lot of queries on Tableau. What could be the issue? I have taken off limit to accept connections from zookeeper just in case it had anything to do with it. Pointers?
... View more
Labels:
- Labels:
-
Apache Hive
08-27-2018
09:38 AM
public dns of my master node changed . Now, oozie workflows would not trigger and cannot submit new corrdinators. ec2-12-123-123-12.ap-south-1.compute.amazonaws.com is the old Public DNS of Failed to create deployment directory: HTTPConnectionPool(host='ec2-12-123-123-12.ap-south-1.compute.amazonaws.com', port=50070): Max retries exceeded with url: /webhdfs/v1/user/hue/oozie/deployments/_%24USER_-oozie-%24JOBID-%24TIME?op=GETFILESTATUS&user.name=hue&doas=hue (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc5bcde5ed0>: Failed to establish a new connection: [Errno 110] Connection timed out',)HTTPConnectionPool(host='ec2-12-123-123-12.ap-south-1.compute.amazonaws.com', port=50070): Max retries exceeded with url: /webhdfs/v1/user/hue/oozie/deployments/_%24USER_-oozie-%24JOBID-%24TIME?op=GETFILESTATUS&user.name=hue&doas=hue (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fc5bcde5ed0>: Failed to establish a new connection: [Errno 110] Connection timed out',))
Please suggest what values need to be changed.
... View more
Labels:
- Labels:
-
Apache Oozie
08-25-2018
06:12 PM
I cannot format name node since it is my production cluster. Namenode is not starting because of this error though: Please suggest what needs to be done to fix this safely org.apache.hadoop.hdfs.server.namenode.FSNamesystem Encountered exception loading fsimage
java.io.IOException: NameNode is not formatted.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:222)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1097)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:779)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
... View more
- Tags:
- namenode
Labels:
- Labels:
-
Apache Hadoop
07-02-2018
02:17 PM
My Sqoop jobs and Hive queries are randomly getting killed: All I get in job logs in: Diagnostics: Application killed by a user. I know for sure that these jobs are not being killed by anyone. My RM is running in HA mode. All my services are up and running without any warnings. I don't think it could be a memory issue since I can see my servers have available memory and it happens even when very few jobs are running.Please help.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Sqoop
-
Apache YARN
06-11-2018
01:00 AM
Alright. But could the higher value of NameNode heap size could possible result in node manager exits? @Geoffrey Shelton Okot
... View more
06-10-2018
04:20 PM
I don't see hs_err_pid.log on the host. checked using sudo find -name *hs_err_pid* @tsokorai NM has no clue at the end. nothing at all
... View more
06-10-2018
03:58 PM
NameNode heap size is 5GB. DataNode heap size is 2GB. JVM option for datanode:-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled . Java option for NameNode : -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled . @Geoffrey Shelton Okot
... View more
06-10-2018
03:55 PM
I had to create the myid file and data directory manually. Finally the service started on my host machines, however I am still getting: Bad :Canary test failed to create an ephemeral znode. I also had to change permissions of var/lib/zookeeper folder to zookeeper-user which was earlier set to root. I have a feeling this error as well is because of some permission issue. How do I fix this?
... View more
Labels:
- Labels:
-
Apache Zookeeper
06-10-2018
10:22 AM
About all node managers going down, On restarting node managers, I realized it was picking a lot of containers from yarn-nm-recovery so I got rid of that folder. Now, all my node managers are not down but still running into continuous exits and I seem to have no way to debug this. I have allocated 2GB heap space and I can see it does not need more than a GB. The only thing that I see could be a problem is number of java threads waiting . It's about 40-50 and also 50-60 threads running at a time
... View more
06-10-2018
09:10 AM
9 worker nodes. These only have HDFS and Node managers installed on them. These shutdowns are the result of continuous exits by Node manager, however, I am not able to understand why my node managers are running into continuous exits. Would be really great. These run into unexpected exists even when there are a handful of jobs running and it keep happening throughout the data. I have tried looking through the logs but not seeing any errors there. @Geoffrey Shelton Okot
... View more
06-10-2018
05:41 AM
My resource managers are active and so is Job history server. All my worker nodes had been exiting randomly for some time but used to restart automatically. today, all my node managers are down. what could be the reason? My worker nodes are typical with Hdfs And yarn on them. HDFS is running fine. what does it indicate when all node managers are down? There was no unusual load on servers. also, if I restart them, it still goes down. please suggest what could cause this?
... View more
- Tags:
- node-manager
Labels:
- Labels:
-
Apache YARN
-
HDFS
06-07-2018
02:47 AM
Is anybody active here at all?
... View more
06-05-2018
12:14 PM
I am experiencing a little off behaviour from node manager and resource manager: I have about 9-10 node managers in my cluster. HA mode is enabled for the resource manager. Out of two nodes, RM is running on, whenever active RM runs on node1, my node managers keep exiting. However, this behavior is rare when RM is active on node2. In this, I would just remove RM from node 1 and install it on another node, but as soon as I do it, my oozie jobs start getting killed with Job tracker is not whitelisted on oozie server. Is that because by job history server is installed on node1 or something else? Please suggest. I don't see out of memory errors in logs so not sure what's wrong with RM running on node1. I do have a lot of services like hue, oozie, sentry ,hive,sqoop running on node1 none of which show any problems.
... View more
Labels:
- Labels:
-
Cloudera Manager
06-05-2018
07:34 AM
Here's a few things I notice randomly: 1. My hive queries are really slow at times 2. Node managers exit (bunch of them) at the same time My worker nodes only have YARN and HDFS on them and my memory is not overcomitted either. RM is running on a node that has sufficient memory and does not have YARN on that node. My nodes are m4.4x instances and I can see it's not being used to full capacity and yet my jobs are slow and get stuck a lot of times. What could be the issue? My jobs are run through oozie and I have allocated 2gb to . it. NM and SNN has 5gb memory.All my services are in healthy state except for node managers running into unexpected exits and I am not able to find out why is that happening. help.
... View more
Labels:
- Labels:
-
Apache YARN
05-30-2018
06:53 AM
I have been running into 137 exit container code and this is what I got from one of my sqoop job logs: 18/05/30 11:49:45 INFO mapreduce.Job: Running job: job_1527499476017_1258818/05/30 11:49:52 INFO mapreduce.Job: Job job_1527499476017_12588 running in uber mode : false18/05/30 11:49:52 INFO mapreduce.Job:map 0% reduce 0%18/05/30 11:50:04 INFO ipc.Client: Retrying connect to server: ip-172-31-4-147.ap-south-1.compute.internal/172.31.4.147:43852. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)18/05/30 11:50:05 INFO ipc.Client: Retrying connect to server: ip-172-31-4-147.ap-south-1.compute.internal/172.31.4.147:43852. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)18/05/30 11:50:06 INFO ipc.Client: Retrying connect to server: ip-172-31-4-147.ap-south-1.compute.internal/172.31.4.147:43852. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)18/05/30 11:51:04 INFO mapreduce.Job: Task Id : attempt_1527499476017_12588_m_000009_1000, Status : FAILEDContainer killed on request. Exit code is 137Container exited with a non-zero exit code 137Killed by external signal18/05/30 11:51:04 INFO mapreduce.Job: Task Id : attempt_1527499476017_12588_m_000008_1000, Status : FAILEDContainer killed on request. Exit code is 137Container exited with a non-zero exit code 137Killed by external signal18/05/30 11:52:06 INFO mapreduce.Job:map 20% reduce 0% What does it say about the error? My services are up and running and there have been no unexpected exits either
... View more
Labels:
- Labels:
-
Apache Sqoop
-
Apache YARN
05-30-2018
02:21 AM
I started getting this error randomly. I am not sure I understand this and my services look healthy. pointers?
Failing Oozie Launcher, Hadoop job Id mismatch, action file [hdfs://ip-172-31-4-192.ap-south-1.compute.internal:8020/user/hue/oozie-oozi/0000615-180528145319004-oozie-oozi-W/sqoop-9885--sqoop/0000615-180528145319004-oozie-oozi-W@sqoop-9885@0] declares Id [null] current Id [job_1527499476017_10822]
Launcher config error Hadoop job Id mismatch, action file [hdfs://ip-172-31-4-192.ap-south-1.compute.internal:8020/user/hue/oozie-oozi/0000615-180528145319004-oozie-oozi-W/sqoop-9885--sqoop/0000615-180528145319004-oozie-oozi-W@sqoop-9885@0] declares Id [null] current Id [job_1527499476017_10822]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://ip-172-31-4-192.ap-south-1.compute.internal:8020/user/hue/oozie-oozi/0000615-180528145319004-oozie-oozi-W/sqoop-9885--sqoop/action-data.seq
Successfully reset security manager from null to null
Oozie Launcher ends
... View more
- Tags:
- hadoop
- Hadoop Core
Labels:
- Labels:
-
Apache Hadoop
05-28-2018
06:25 AM
I am not using spark. Both hive and sqoop jobs were getting killed. I increased the number of attempts to 5 and sqoop jobs are fine now but hive jobs are still getting stuck. Also, now, instead of 137 error, all my node managers are running into unexpected exit error. I can see about 181 timed waiting threads in resource manager but JVM heap memory usage seems fine.
... View more
05-28-2018
05:55 AM
I have 9 node managers in my cluster and all of them are randomly exiting. I can see that suddenly java threads are pretty high which was not the case so far. What could be the reason that all node managers randomly exit? Reiterating : not specific to one or two node managers.
... View more
05-25-2018
05:20 PM
I am using --warehouse-dir argument for loading data in HDFS before sqoop puts it into hive. I am running all my sqoop jobs through oozie. Now, if the task fails for some reason, it is reattempted and the problem here is that the warehouse dir created by previous task is still there and the task re-attempt fails with output directory already exists. I understand I could use direct argument to skip intermediate loading in HDFS step but I need to use drop import hive delims argument as well and that's not supported with Hive. Advice, please? It's important.
... View more
Labels:
- Labels:
-
Apache Sqoop
05-25-2018
01:22 PM
@Harald Berghoff: Thank you for your response. I feel like I am in deep shit and really really need some help here. I have checked dmesg and it has not recorded any killed processed. We have all our jobs scheduled through oozie and we heavily depend on scheduled jobs. RAM on worker nodes, right? My worker nodes have 64 GB RAM and I can see free memory on nodes. From Resource Manager, I can see vCores getting used up before memory. Cluster has 225 GB memory and 54 VCores. For hosts I am using m4.4x instance. I can share my yarn configuration if you would like. Is there a way I can get some professional help here? I am okay with paid support for the issue.
... View more
05-25-2018
12:53 AM
Container exited with a non-zero exit code 137 Killed by external signal
This error randomly kills hive and sqoop jobs. Is there anyone here who is willing to support? Been trying to get an answer but no luck. Talking about checking logs, I have checked container logs and resource manager logs and service-specific logs, there is really nothing that points out why would this error be happening. I am using m4.4x large instances from AWS yarn.nodemanager.resource.memory-mb: 50 GIB Java Heap Size of ResourceManager in Bytes: 2GB yarn.scheduler.maximum-allocation-mb : 25GB Java Heap Size of NodeManager in Bytes: 2gb yarn.nodemanager.resource.cpu-vcores: 14 yarn.scheduler.maximum-allocation-vcores : 8 yarn.nodemanager.resource.cpu-vcores and yarn.scheduler.maximum-allocation-vcores values are different because I have node manager groups and some of the instances are m4.2x large which have 8 cpus for node manager. there fore I have taken the minimum of two for yarn.scheduler.maximum-allocation-vcores : 8 Please suggest ifthere is something off in my configuration. This error happens randomly even when there are not a lot of jobs running.
... View more
Labels:
- Labels:
-
Apache YARN