Member since
03-14-2016
4721
Posts
1111
Kudos Received
874
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2722 | 04-27-2020 03:48 AM | |
| 5283 | 04-26-2020 06:18 PM | |
| 4447 | 04-26-2020 06:05 PM | |
| 3570 | 04-13-2020 08:53 PM | |
| 5377 | 03-31-2020 02:10 AM |
07-17-2017
09:44 AM
@jack jack
If the NameNode is returning incorrect data then From Ambari side we can not do much. Please check the NameNode UI to see if you are finding the problematic DataNode listed there? http://$NAMENODE_HOST:50070/dfshealth.html#tab-datanode . Please check if you see the DataNode name listed in the above URL? Try restarting the NameNode and then see if it fixes the stale datanode list. .
... View more
07-17-2017
07:55 AM
1 Kudo
@Anurag Mishra If you will make changes to any script or configuration file (like hdfs-site.xml/core-site.xml) manually then it will work only when you will start those components manually. But if you will start those components from Ambari then ambari will use the same configuration (which is stored in the ambari DB) to start/stop those components from ambari UI. So your manual changes will be overwritten by ambari. Hence you should not make manual changes to the scripts or XML files on individual hosts when managing cluster via ambari, else upon restart of those services your manual changes will be overwritten (gone).
... View more
07-17-2017
03:22 AM
@andy zhou Looks like you have opened duplicate threads: https://community.hortonworks.com/questions/114193/ambari-server-cant-manager-the-servercant-stop-or-2.html#answer-114195 Can you please close one of them to avoid duplicate HCC threads. .
... View more
07-16-2017
11:38 AM
@srinivas p 2017-07-15T13:50:33.501-0500: 4009.839: [Full GC (Allocation Failure) 2017-07-15T13:50:33.501-0500: 4009.840: [CMS2017-07-15T13:50:39.567-0500: 4015.905: [CMS-concurrent-mark: 12.833/12.841 secs] [Times: user=20.33 sys=5.59, real=12.84 secs] (concurrent mode failure): 14680064K->14680064K(14680064K), 39.2851287 secs] 24117247K->22948902K(24117248K), [Metaspace: 36771K->36771K(1083392K)], 39.2852865 secs] [Times: user=39.18 sys=0.04, real=39.29 secs]
.
2017-07-15T13:52:15.250-0500: 4111.588: [Full GC (Allocation Failure) 2017-07-15T13:52:15.250-0500: 4111.588: [CMS2017-07-15T13:52:21.412-0500: 4117.750: [CMS-concurrent-mark: 12.025/12.030 secs] [Times: user=17.74 sys=1.38, real=12.03 secs] (concurrent mode failure): 14680063K->14680063K(14680064K), 39.5266803 secs] 24117247K->23076661K(24117248K), [Metaspace: 36781K->36781K(1083392K)], 39.5268469 secs] [Times: user=39.41 sys=0.05, real=39.53 secs]
.
We see that out of 24GB almost all 24GB is being utilized by the DataNode and the Garbage collector is hardly able to clean up the 1 GB memory. 24117247K->22948902K(24117248K)
AND
24117247K->23076661K(24117248K)
It indicates that the Heap Size is not sufficient for the DataNode or the DataNode cache settings are not appropriately set. - Can you please share the core-site.xml and hdfs-site.xml - Some issues are reported for similar behavior: https://issues.apache.org/jira/browse/HDFS-11047 .
... View more
07-16-2017
09:29 AM
@srinivas p
Can you please check the following : 1. If the "hs_err_pid" file for the DataNode is being generated? 2. Anything strange observed in the "/var/log/messages", when the DataNode went down. 3. Your OS has the SAR report enabled? This will help us in finding the Historical data of the events that occurred at the operating system level to find out if there is any thing unusual happened. (like spike on the memory usage/CPU/IO...etc). http://www.thegeekstuff.com/2011/03/sar-examples/
4. Have you recently upgraded your OS (kernel patches... etc)? 5. Can you please share the DataNode Garbage Collection (GC) logs?
.
... View more
07-16-2017
08:35 AM
@srinivas p Based on the following GC logging Detected pause in JVM or host machine (eg GC): pause of approximately 23681msGC pool 'ParNew' had collection(s): count=1 time=0msGC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=23769ms
We see that the GC pause is comparatively very high (around 23 seconds) which can happen if the GC is not happening very aggressively, so the heap keeps growing with the time until it reaches to 90+% of the whole DataNode heap then the GC gets triggered.
In this case we can make the GC to happen nit aggressively by adding the following options in the DataNode JVM settings "HADOOP_DATANODE_OPTS"
Ambari UI --> HDFS --> Configs (Tab) --> Advanced (child Tab)--> "hadoop-env template" -> Find all the "HADOOP_DATANODE_OPTS" (includinf if .. else) both the blocks and add the following settings. -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction : The Throughput Collector starts a GC cycle only when the heap is full, i.e., when there is not enough space available to store a newly allocated or promoted object. With the CMS Collector, it is not advisable to wait this long because it the application keeps on running (and allocating objects) during concurrent GC. Thus, in order to finish a GC cycle before the application runs out of memory, the CMS Collector needs to start a GC cycle much earlier than the Throughput Collector.
-XX+UseCMSInitiatingOccupancyOnly : We can use the flag -XX+UseCMSInitiatingOccupancyOnly to instruct the JVM not to base its decision when to start a CMS cycle on run time statistics. Instead, when this flag is enabled, the JVM uses the value of CMSInitiatingOccupancyFraction for every CMS cycle, not just for the first one. However, keep in mind that in the majority of cases the JVM does a better job of making GC decisions than us humans. Therefore, we should use this flag only if we have good reason (i.e., measurements) as well as really good knowledge of the lifecycle of objects generated by the application.
Also Recommendation is to set the young generation heap size (-XX:MaxNewSize) to be set to (1/8th) of total Max heap. Young gen has the parallel collectors and most short lived objects are removed before it promoting to old gen, same old gen has enough space and threads to handle the promotions.
Also can you please check your filesystem to see if the DataNode process is crashing and generating the "hs_err_pid" files? If due to some reason the DataNode JVM process is getting crashed then you should see a file "hs_err_pid" it can be helpful to understand why the DataNode process is getting crashed. Your DataNode process might have the following option enabled by default which tells the JVM to generate the crash dump text file in the following location (just in case if it is crashing). -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log .
... View more
07-15-2017
05:31 PM
@Hovo Khachikyan Following seems to be the culprit of the issue, Looks like a N/W issue or a proxy setup issue. 2017-07-14 16:46:25,425 - Execution of '/usr/bin/yum -d 0 -e 0 -y install hadoop_2_3_2_0_2950' returned 1. Error: Cannot find a valid baseurl for repo: base Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os error was
.
.
PYCURL ERROR 7 - "Failed to connect to 2604:1580:fe02:2::10: Network is unreachable"
Can you please check if you can "wget" the mentioned URL from the problematic host? (To verify if the proxy settings are working or not? Else the proxy settings need to be defined inside the "~/.profile" or at the ENV level)
# wget http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
Also please check the "/etc/yum.conf" file to see if the "proxy" setting is mentioned there or not?
# grep 'proxy' /etc/yum.conf .
... View more
07-15-2017
04:32 PM
@Mahendra More
If the user "mahi" does not have access to "mysql" database (Which is the default Database) then you will get the same error when using t he following command. # mysql -u mahi -p hadoop . You have two options. 1. Grant "mahi" user to access "mysql" (default Database) OR 2. You should provide database name on which the "mahi" user has access. (If you want to login to any specific database) Example: # mysql -u mahi -p mahiDatabaseName
Enter password: hadoop
... View more
07-15-2017
01:36 PM
@Mahendra Malpute
The following error indicates that the database that you are trying to connect using "mahi" credentials is wither not correct or The Database is not correct. ERROR 1045 (28000): Access denied for user 'mahi'@'localhost' (using password: YES)
.
Try this: GRANT ALL PRIVILEGES ON *.* TO 'mahi'@'localhost';
GRANT ALL PRIVILEGES ON *.* TO 'mahi'@'%';
GRANT ALL PRIVILEGES ON *.* TO 'mahi'@'<DATABASE_FQDN>';
FLUSH PRIVILEGES; .
... View more
07-09-2017
07:00 AM
@Ashwini Patkar In a kerberized environment the DataNode does not use the RPC framework of Hadoop, DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address. There you will see this PID is being used. "/var/run/hadoop/hdfs/hadoop_secure_dn.pid"
Example: jsvc.exec -Dproc_datanode -outfile /var/log/hadoop/hdfs/jsvc.out -errfile /var/log/hadoop/hdfs/jsvc.err -pidfile /var/run/hadoop/hdfs/hadoop_secure_dn.pid
You can find the "HADOOP_SECURE_DN_PID" parameter info inside the "/usr/hdp/current/hadoop-hdfs-datanode/bin/hdfs.distro" file. Please see: https://community.hortonworks.com/articles/90673/why-datanodes-have-two-processes-on-a-kerberized-c.html .
... View more