About cnauroth

cnauroth · ‎02-03-2016

The link to the Checkpoint Node here is not relevant to HDP or any other modern Hadoop distro AFAIK. The Checkpoint Node provided a way to generate periodic checkpoints of the NameNode metadata. It was an evolution of the SecondaryNameNode. The current architecture is to run NameNode HA using QuorumJournalManager with a redundant pair of NameNodes. In this architecture, whichever NameNode is in standby state also takes responsibility for managing checkpoints as was previously done by the SecondaryNameNode.

cnauroth · ‎02-03-2016

Starting in HDP 2.3, the Hadoop shell ships with a find command. Full details are available in the FileSystemShell find documentation in Apache. However, unlike the standard Unix command, the Hadoop version does not yet implement the "maxdepth" or "type" options shown in your example. There are several uncommitted patches still in progress to add these features. HADOOP-10578 implements "maxdepth". HADOOP-10579 implements "type". These features are not yet available in any release of either HDP or Apache Hadoop. Until these features become generally available, I think your only other option is to use wildcard glob matching as suggested in prior answers. I understand you said that there is some variability to the names because of dates and times embedded into the names. You would need to find a way to stage these files in a predictable way, so that you can effectively use a wildcard to match only the files that you want to match. This might require renaming files or moving them into a different directory structure at time of ingest. Another possible option could be to script it externally, such as by using bash to run an ls -R command, parse the results, and then call the Hadoop shell again using only the files that you want. However, this would introduce overhead from needing to start a separate Hadoop shell process (a JVM) for each command, which might be unacceptable.

cnauroth · ‎01-28-2016

The permission denied error is coming from bash, not Hadoop. Is that the full command that you were running? If so, then you ended up trying to execute the jar file directly. Since the jar file won't have the execute bit on in its permissions, this gets reported as permission denied. Instead, you would need to run the jar through the "hadoop jar" command. hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar

cnauroth · ‎01-14-2016

A correctly configured HDFS client will handle the StandbyException by attempting to fail itself over to the other NameNode in the HA pair, and then it will reattempt the operation. It's possible that the application is misconfigured, so that it is not aware of the NameNode HA pair, and therefore the StandbyException becomes a fatal error. I recommend reviewing the configuration properties related to NameNode HA described in this Apache documentation: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html In particular, note the section about "ConfiguredFailoverProxyProvider". This is the thing that enables the automatic failover behavior in the client. HDP clusters that use NameNode HA will set this property. This error appears to be coming from the metastore, so I recommend checking that the metastore is in fact running with the correct set of configuration files.

cnauroth · ‎01-14-2016

The stack trace indicates the DataNode was serving a client block read operation. It attempted to write some data to the client on the socket connection, but the write timed out. This likely indicates a client-side problem, not a DataNode problem. I agree with the assessments to check networking. If you know what client application was running from the "remote" address in the stack trace, then it's also helpful to investigate any logs generated by that application.

cnauroth · ‎01-11-2016

A background thread in the NameNode scans a replication queue and schedules work on specific DataNodes to repair under- (or over-) replicated blocks based on the items in that queue. This replication queue is populated by a different background thread that monitors heartbeat status of every DataNode. If the heartbeat monitor thread detects that a DataNode has entered "dead" state, then it remove its record of replicas living on that DataNode. If this causes a block to be considered under-replicated, then that block is submitted to the replication queue. Under typical configuration, a DataNode is considered dead approximately 10 minutes after receipt of its last heartbeat at the NameNode. This is governed by configuration properties dfs.heartbeat.interval and dfs.namenode.heartbeat.recheck-interval, so if these configuration properties have been tuned for some reason, then my assumption of 10 minutes no longer holds. <property> <name>dfs.heartbeat.interval</name> <value>3</value> <description>Determines datanode heartbeat interval in seconds.</description> </property> <property> <name>dfs.namenode.heartbeat.recheck-interval</name> <value>300000</value> <description> This time decides the interval to check for expired datanodes. With this value and dfs.heartbeat.interval, the interval of deciding the datanode is stale or not is also calculated. The unit of this configuration is millisecond. </description> </property> Until that time has passed, the NameNode will not queue replication work associated with that DataNode. Bottom line: barring any unusual configuration tunings, it's a race for the node to restart in less than 10 minutes. Replication work will not get queued unless the node fails to restart within that time limit. Whatever your plans for implementing this restart, I recommend testing before a full production roll-out.

cnauroth · ‎01-11-2016

The Hadoop RPC client is coded to re-login from the keytab automatically if it detects an RPC call has failed due to a SASL authentication failure. There is no requirement for special configuration or for the applications (Solr in this case) to write special code to trigger this re-login. I recently wrote a detailed description of this behavior on Stack Overflow. If this is not working in your environment, and you start seeing authentication failures after a process runs 24 hours, then I recommend reviewing Apache JIRA HADOOP-10786. This was a bug that impacted the automatic re-login from keytab on certain JDK versions. On the JDK 7 line, I know the problem was introduced in JDK 1.7.0_80. On the JDK 8 line, I'm not certain which exact JDK release introduced the problem. If after reviewing HADOOP-10786 you suspect this is the root cause, then you can fix it by either downgrading the JDK to 1.7.0_79 or upgrading Hadoop. The HADOOP-10786 patch changes the Hadoop code so that it will work correctly with all known JDK versions. For Apache Hadoop, the fix shipped in version 2.6.1 and 2.7.0. For HDP, the fix shipped in version 2.2.8.0 and 2.3.0.0. All subsequent versions would have the fix too.

cnauroth · ‎01-06-2016

I recommend taking a look at Apache JIRA HDFS-6376. This issue addressed the problem of DistCp across 2 different HA clusters. The solution introduces a new configuration property, dfs.internal.nameservices. This allows you to set up configuration to differentiate between "all known nameservices" and "nameservices that this cluster's DataNodes need to report to." <property> <name>dfs.internal.nameservices</name> <value></value> <description> Comma-separated list of nameservices that belong to this cluster. Datanode will report to all the nameservices in this list. By default this is set to the value of dfs.nameservices. </description> </property> HDFS-6376 is included in all versions of both HDP 2.2 and HDP 2.3. It is not included in any release of the HDP 2.1 line.

cnauroth · ‎12-30-2015

According to the stack trace, there was an IllegalArgumentException while trying to create a ThreadPoolExecutor. This is the relevant source code from the S3AFileSystem class: int maxThreads = conf.getInt(MAX_THREADS, DEFAULT_MAX_THREADS); int coreThreads = conf.getInt(CORE_THREADS, DEFAULT_CORE_THREADS); if (maxThreads == 0) { maxThreads = Runtime.getRuntime().availableProcessors() * 8; } if (coreThreads == 0) { coreThreads = Runtime.getRuntime().availableProcessors() * 8; } long keepAliveTime = conf.getLong(KEEPALIVE_TIME, DEFAULT_KEEPALIVE_TIME); LinkedBlockingQueue<Runnable> workQueue = new LinkedBlockingQueue<>(maxThreads * conf.getInt(MAX_TOTAL_TASKS, DEFAULT_MAX_TOTAL_TASKS)); threadPoolExecutor = new ThreadPoolExecutor( coreThreads, maxThreads, keepAliveTime, TimeUnit.SECONDS, workQueue, newDaemonThreadFactory("s3a-transfer-shared-")); threadPoolExecutor.allowCoreThreadTimeOut(true); The various arguments passed to the ThreadPoolExecutor are pulled from Hadoop configuration, such as the core-site.xml file. The defaults for these are defined in core-default.xml: <property> <name>fs.s3a.threads.max</name> <value>256</value> <description> Maximum number of concurrent active (part)uploads, which each use a thread from the threadpool.</description> </property> <property> <name>fs.s3a.threads.core</name> <value>15</value> <description>Number of core threads in the threadpool.</description> </property> <property> <name>fs.s3a.threads.keepalivetime</name> <value>60</value> <description>Number of seconds a thread can be idle before being terminated.</description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>1000</value> <description>Number of (part)uploads allowed to the queue before blocking additional uploads.</description> </property> Is it possible that you have overridden one of these configuration properties to an invalid value, such as a negative number?

cnauroth · ‎12-28-2015

Yes, the NodeManager is responsible for launching application containers. The NodeManager also has the capability to monitor resource consumption by the containers it launches, and terminate them if they exceed their allocated resource utilization.

Online	Offline
Last Visited	‎01-13-2017 05:20 PM

Member Since	‎09-29-2015 10:51 PM
Last Visited	‎01-13-2017 05:20 PM
Posts	123
Kudos received	216

Cloudera Community

Re: How to debug the issue "IPC's epoch X is less ...

Re: Why hdfs://mycluster/ different from /

Re: querying a partition table

Re: NameNode HA Ambari Display Issue

Re: Tips for optimizing export to S3(n) ?

Re: When runtime modifications are written to Edit...

Re: How do you move files but not the directories ...

Re: TestDFSIO in latest HDP

Re: org.apache.hadoop.ipc.StandbyException

Re: DataXceiver error processing READ_BLOCK operat...

Re: How does restarting a data node affect block r...

Re: Kerberos ticket isn't being renewed by Solr wh...

Re: How to use Name Service ID between to Clusters

Re: I am getting below exception by using distcp t...

Re: Which MRv2/YARN daemons is the one responsibl...