Member since
07-31-2013
1924
Posts
462
Kudos Received
311
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1542 | 07-09-2019 12:53 AM | |
9290 | 06-23-2019 08:37 PM | |
8049 | 06-18-2019 11:28 PM | |
8676 | 05-23-2019 08:46 PM | |
3473 | 05-20-2019 01:14 AM |
05-09-2019
02:39 AM
1 Kudo
Spark running on YARN will use the temporary storage presented to it by the NodeManagers where the containers run. These directory path lists are configured via Cloudera Manager -> YARN -> Configuration -> "NodeManager Local Directories" and "NodeManager Log Directories". You can replace its values to point to your new, larger volume, and it will cease to use your root partition. FWIW, the same applies for HDFS if you use it. Also see: https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html
... View more
05-09-2019
02:09 AM
Quoted from documentation about using Avro files at https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_avro_usage.html#topic_26_2 """ Hive (…) To enable Snappy compression on output [avro] files, run the following before writing to the table: SET hive.exec.compress.output=true; SET avro.output.codec=snappy; """ Please try this out. You're missing only the second property mentioned here, which appears specific to Avro serialization in Hive. Default compression of Avro is deflate, so that explains the behaviour you observe without it.
... View more
05-09-2019
01:33 AM
Are all of your processes connecting onto the same Impala Daemon, or are you using a load balancer / varying connection options? Each Impala Daemon can only accept a finite total number of active client connections, which is likely what you are running into. Typically for concurrent access to a DB, it is better to use a connection pooling pattern with finite connections shared between threads of a single application. This avoids overloading a target server. While I haven't used it, pyodbc may support connection pooling and reuse which you can utilise via threads in python, instead of creating separate processes. Alternatively, spread the connections around, either by introducing a load balancer, or by varying the target options for each spawned process. See https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html and http://www.cloudera.com/documentation/other/reference-architecture/PDF/Impala-HA-with-F5-BIG-IP.pdf for further guidance and examples on this.
... View more
05-08-2019
07:33 PM
1 Kudo
Are you looking for a sequentially growing ID or just a universally unique ID? For the former, you can use Curator over ZooKeeper with this recipe: https://curator.apache.org/curator-recipes/distributed-atomic-long.html For the latter, a UUID generator may suffice. For a more 'distributed' solution, checkout Twitter's Snowflake: https://github.com/twitter-archive/snowflake/tree/snowflake-2010
... View more
05-08-2019
07:15 PM
There's no 'single' query tracking in HBase because of its distributed nature (your scan range may boil down into several different regions, hosted and served independently by several different nodes). Access to data is audited if you enable TRACE level logging on the AccessController class, or if you use Cloudera Navigator Audit Service in your cluster. The audit information will capture the requestor and the kind of request, but not the parameters of the request. If it is the parameters of your request (such as row ranges, filters, etc.) you're interested in, could you explain what the use-case is for recording it?
... View more
05-08-2019
06:42 PM
1 Kudo
Running over a public IP may not be a good idea if it is open to the internet. Consider using a VPC? That said, you can point HBase Master and RegionServer to use the address from a specific interface name (eth0, eth1, etc.) and/or a specific DNS resolver (IP or name that can answer to a dns:// resolving call) via advanced config properties: hbase.master.dns.interface hbase.master.dns.nameserver hbase.regionserver.dns.interface hbase.regionserver.dns.nameserver By default the services will use whatever is the host's default name and resolving address: getent hosts $(hostname -f) and publish this to clients.
... View more
05-07-2019
09:58 PM
Depends on what you mean by 'storage locations'. If you mean "can other apps use HDFS?" then the answer is yes, as HDFS is an independent system unrelated to YARN and has its own access and control mechanisms not governed by a YARN scheduler. If you mean "can other apps use the scratch space on NM nodes" then the answer is no, as only local containers get to use that. If you're looking to strictly split both storage and compute, as opposed to just some form of compute, then it may be better to divide up the cluster entirely.
... View more
05-07-2019
06:25 PM
Our Isilon doc page covers some of your asks, including the differences on security features (as of posting, the Isilon solution did not support ACLs, or transparent encryption), but does support Kerberos Authentication: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html > extending an existed CDH HDFS cluster with Isilon If by extending you mean "merging" the storage under a common namespace, that is not currently possible (in 5.x/6.x). > using of Isilon as a backup of an existed CDH HDFS cluster Cloudera Enterprise BDR (Backup and Disaster Recovery) features support replicating to/from Isilon in addition to HDFS, so this is doable: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_pcm_bdr.html#supported_replication_isilon
... View more
05-07-2019
06:01 PM
Could you share your CM agent logs snippets from right after the parcel activated and the host inspector showed the missing components/users? The users are typically created (if they do not pre-exist) by the Cloudera Manager agent when the parcel is activated for the first time. It is possible something may have gone wrong at that step, so having the agent logs will be helpful to troubleshoot it.
... View more
05-07-2019
05:48 PM
HDFS only stores two time points in its INode data structures/persistence: The modification time and the access time [1]. For files, the mtime is effectively the time of when the file was last closed (such as when originally written and closed, or when reopened for append and closed). In general use this does not change very much for most files you'll place on HDFS and can serve as a "good enough" creation time. Is there a specific use-case you have in mind that requires preservation of the original create time? [1] https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeAttributes.java#L61-L65
... View more