About Harsh J

Harsh J · ‎04-09-2019

To add on: If you will not require audits or lineage at all for your cluster, you can also choose to disable their creation: Impala - Configuration - "Enable Impala Lineage Generation" (uncheck) Impala - Configuration - "Enable Impala Audit Event Generation" (uncheck) If you are using Navigator with Cloudera Enterprise, then these audits and lineage files should be sent automatically to the Navigator services. If they are not passing through, it may be an indicator of problem in the pipeline - please raise a support case if this is true.

Harsh J · ‎04-03-2019

Is the job submitted to the source cluster, or the destination? The DistCp jobs should only need to contact the NodeManagers of the cluster it runs on, but if the submitted cluster is remote then the ports may need to be opened. The HDFS transfer part does not involve YARN service communication at all, so it is not expected to contact a NodeManager. It would be helpful if you can share some more logs leading up to the observed failure.

Harsh J · ‎04-01-2019

Could you share the full log from this failure, both from the Oozie server for the action ID and the action launcher job map task logs? The 8042 port is the NodeManager HTTP port, useful in serving logs of live containers among other status details over REST. It is not directly used by DistCp in its functions, but MapReduce and Oozie diagnostics might be invoking it as part of a response to a failure, so it is a secondary symptom. Note though that running DistCp via Oozie requires you to provide appropriate configs that ensures delegation tokens for both kerberized clusters are acquired. Use "mapreduce.job.hdfs-servers" with a value such as "hdfs://namenode-cluster-1,hdfs://namenode-cluster-2" to influence this on the Oozie server's delegation token acquisition phase. This is only relevant if you use Kerberos on both clusters.

Harsh J · ‎03-07-2019

You'll need to use lsof with a pid specifier (lsof -p PID). The PID must be your target RegionServer's java process (find via 'ps aux | grep REGIONSERVER' or similar). In the output, you should be able to classify the items as network (sockets) / filesystem (files) / etc., and the interest would be in whatever holds the highest share. For ex. if you see a lot more sockets hanging around, check their state (CLOSE_WAIT, etc.). Or if it is local filesystem files, investigate if those files appear relevant. If you can pastebin your lsof result somewhere, I can take a look.

Harsh J · ‎03-06-2019

MapReduce jobs can be submitted with ease, as all they mostly require is the correct config on the classpath (such as under src/main/resources for Maven projects). Spark/PySpark greatly relies on its script tooling to submit to a remote cluster so it is a little more involved to achieve this. IntelliJ IDEA has a remote execution option in its run targets that can be configured to copy over the build jar and invoke any arbitrary command on an edge host. This can be combined with remote debugging perhaps to get equal experience as MR. Another option is to use a web interface based editor such as CDSW.

Harsh J · ‎03-06-2019

> can we deploy the HttpFS role on more than one node? is it best practice? Yes, the HttpFs service is an end-point for REST API access to HDFS, so you can deploy multiple instances and also consider load balancing (might need sticky sessions for data read paging). > we can see that new logs are created on opt/hadoop/dfs/nn/current on the actine namenode on node01 but no new files . on the standby namenode no node02 - is it OK ?? Yes, this is normal. The new edit logs are redundantly written only when the NameNode is active. At all times the edits are primarily always written into the JournalNode directories.

Harsh J · ‎03-06-2019

It is not normal to see the file descriptors limit run out or run close to limit unless you have an overload problem of some form. I'd recommend checking via 'lsof' what is the major contributor towards the FD count for your RegionServer process - chances are it is avoidable (a bug, a flawed client, etc.). The number should be proportional to your total region store file counts and the number of connecting clients. While the article at https://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/ focuses on DN data transceiver threads in particular, the formulae at the end can be applied similarly to file descriptors in general too.

Harsh J · ‎03-06-2019

The issue appears to crop up when distributing certain configuration files to prepare for installing packages. Could you check or share what the failure is via the log files present under /tmp/scm_prepare_node.*/*?

Harsh J · ‎03-06-2019

Currently Hive's connections to LDAP do not support the StartTLS extension [1]. This does make sense as a feature request however, could you log your request over at https://issues.apache.org/jira/projects/HIVE please? [1] - https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/auth/ldap/LdapSearchFactory.java#L52-L62

Harsh J · ‎03-05-2019

> Clear Cache > This is the one I am not too sure what happens It appears to clear the cached entries within Hue frontend, so the metadata for assist and views is loaded again from its source (Impala, etc.). I don't see it calling a refresh on the tables, but it is possible I missed some implicit action. > Perform Incremental metadata Update > I assume this issues a refresh command for all tables within the current database which is been viewed? If no database is veiwed does it do it for everything? This will compare HMS listing against Impala's for the DB in context and run specific "INVALIDATE METADATA [[db][.table]];" for the missing ones in Impala. Yes, if no DB is in the context, it will equate to running a global "INVALIDATE METADATA;" > Invalidate All metadata and rebuild index This runs a plain "INVALIDATE METADATA;"

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Impala Deamon Logs

Re: DistCp over Oozie .vs. from shell

Re: DistCp over Oozie .vs. from shell

Re: Calculate File Descriptor in HBase

Re: IDE for creating and running CDH jobs using Sc...

Re: can we deploy the HttpFS role on more than one...

Re: Calculate File Descriptor in HBase

Re: Adding new host to cloudera manager cluster fa...

Re: HiveServer2, is StartTLS an option for user au...

Re: HUE Clear Cache, Perform Incremental Metadata ...