About Harsh J

Harsh J · ‎05-09-2019

Spark running on YARN will use the temporary storage presented to it by the NodeManagers where the containers run. These directory path lists are configured via Cloudera Manager -> YARN -> Configuration -> "NodeManager Local Directories" and "NodeManager Log Directories". You can replace its values to point to your new, larger volume, and it will cease to use your root partition. FWIW, the same applies for HDFS if you use it. Also see: https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html

Harsh J · ‎05-09-2019

Quoted from documentation about using Avro files at https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_avro_usage.html#topic_26_2 """ Hive (…) To enable Snappy compression on output [avro] files, run the following before writing to the table: SET hive.exec.compress.output=true; SET avro.output.codec=snappy; """ Please try this out. You're missing only the second property mentioned here, which appears specific to Avro serialization in Hive. Default compression of Avro is deflate, so that explains the behaviour you observe without it.

Harsh J · ‎05-09-2019

Are all of your processes connecting onto the same Impala Daemon, or are you using a load balancer / varying connection options? Each Impala Daemon can only accept a finite total number of active client connections, which is likely what you are running into. Typically for concurrent access to a DB, it is better to use a connection pooling pattern with finite connections shared between threads of a single application. This avoids overloading a target server. While I haven't used it, pyodbc may support connection pooling and reuse which you can utilise via threads in python, instead of creating separate processes. Alternatively, spread the connections around, either by introducing a load balancer, or by varying the target options for each spawned process. See https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html and http://www.cloudera.com/documentation/other/reference-architecture/PDF/Impala-HA-with-F5-BIG-IP.pdf for further guidance and examples on this.

Harsh J · ‎05-08-2019

Are you looking for a sequentially growing ID or just a universally unique ID? For the former, you can use Curator over ZooKeeper with this recipe: https://curator.apache.org/curator-recipes/distributed-atomic-long.html For the latter, a UUID generator may suffice. For a more 'distributed' solution, checkout Twitter's Snowflake: https://github.com/twitter-archive/snowflake/tree/snowflake-2010

Harsh J · ‎05-08-2019

There's no 'single' query tracking in HBase because of its distributed nature (your scan range may boil down into several different regions, hosted and served independently by several different nodes). Access to data is audited if you enable TRACE level logging on the AccessController class, or if you use Cloudera Navigator Audit Service in your cluster. The audit information will capture the requestor and the kind of request, but not the parameters of the request. If it is the parameters of your request (such as row ranges, filters, etc.) you're interested in, could you explain what the use-case is for recording it?

Harsh J · ‎05-08-2019

Running over a public IP may not be a good idea if it is open to the internet. Consider using a VPC? That said, you can point HBase Master and RegionServer to use the address from a specific interface name (eth0, eth1, etc.) and/or a specific DNS resolver (IP or name that can answer to a dns:// resolving call) via advanced config properties: hbase.master.dns.interface hbase.master.dns.nameserver hbase.regionserver.dns.interface hbase.regionserver.dns.nameserver By default the services will use whatever is the host's default name and resolving address: getent hosts $(hostname -f) and publish this to clients.

Harsh J · ‎05-07-2019

Depends on what you mean by 'storage locations'. If you mean "can other apps use HDFS?" then the answer is yes, as HDFS is an independent system unrelated to YARN and has its own access and control mechanisms not governed by a YARN scheduler. If you mean "can other apps use the scratch space on NM nodes" then the answer is no, as only local containers get to use that. If you're looking to strictly split both storage and compute, as opposed to just some form of compute, then it may be better to divide up the cluster entirely.

Harsh J · ‎05-07-2019

Our Isilon doc page covers some of your asks, including the differences on security features (as of posting, the Isilon solution did not support ACLs, or transparent encryption), but does support Kerberos Authentication: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html > extending an existed CDH HDFS cluster with Isilon If by extending you mean "merging" the storage under a common namespace, that is not currently possible (in 5.x/6.x). > using of Isilon as a backup of an existed CDH HDFS cluster Cloudera Enterprise BDR (Backup and Disaster Recovery) features support replicating to/from Isilon in addition to HDFS, so this is doable: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_pcm_bdr.html#supported_replication_isilon

Harsh J · ‎05-07-2019

Could you share your CM agent logs snippets from right after the parcel activated and the host inspector showed the missing components/users? The users are typically created (if they do not pre-exist) by the Cloudera Manager agent when the parcel is activated for the first time. It is possible something may have gone wrong at that step, so having the agent logs will be helpful to troubleshoot it.

Harsh J · ‎05-07-2019

HDFS only stores two time points in its INode data structures/persistence: The modification time and the access time [1]. For files, the mtime is effectively the time of when the file was last closed (such as when originally written and closed, or when reopened for append and closed). In general use this does not change very much for most files you'll place on HDFS and can serve as a "good enough" creation time. Is there a specific use-case you have in mind that requires preservation of the original create time? [1] https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeAttributes.java#L61-L65

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: YARN Spark job filling up /dev/mapper/hostname...

Re: Change default Hive compression codec

Re: Error message from ODBC connection

Re: Generating Unique ID using Zookeeper

Re: view queries and results in logs running from ...

Re: HBase : Zookeeper serves Internal IP to Remote...

Re: Is it possible to reserve whole nodes for excl...

Re: Isilon HDFS vs CDH HDFS

Re: Unable to complete Add Host Wizard on 6.2 Mana...

Re: How to get file or directory creation time in ...