Member since 
    
	
		
		
		07-31-2013
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                1924
            
            
                Posts
            
        
                462
            
            
                Kudos Received
            
        
                311
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1966 | 07-09-2019 12:53 AM | |
| 11824 | 06-23-2019 08:37 PM | |
| 9111 | 06-18-2019 11:28 PM | |
| 10069 | 05-23-2019 08:46 PM | |
| 4510 | 05-20-2019 01:14 AM | 
			
    
	
		
		
		05-09-2019
	
		
		02:39 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							Spark running on YARN will use the temporary storage presented to it by the NodeManagers where the containers run.    These directory path lists are configured via Cloudera Manager -> YARN -> Configuration -> "NodeManager Local Directories" and "NodeManager Log Directories".    You can replace its values to point to your new, larger volume, and it will cease to use your root partition.    FWIW, the same applies for HDFS if you use it.    Also see: https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-09-2019
	
		
		02:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Quoted from documentation about using Avro files at https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_avro_usage.html#topic_26_2    """  Hive  (…)  To enable Snappy compression on output [avro] files, run the following before writing to the table:    SET hive.exec.compress.output=true;  SET avro.output.codec=snappy;  """    Please try this out. You're missing only the second property mentioned here, which appears specific to Avro serialization in Hive.    Default compression of Avro is deflate, so that explains the behaviour you observe without it.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-09-2019
	
		
		01:33 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Are all of your processes connecting onto the same Impala Daemon, or are you using a load balancer / varying connection options?    Each Impala Daemon can only accept a finite total number of active client connections, which is likely what you are running into.    Typically for concurrent access to a DB, it is better to use a connection pooling pattern with finite connections shared between threads of a single application. This avoids overloading a target server.    While I haven't used it, pyodbc may support connection pooling and reuse which you can utilise via threads in python, instead of creating separate processes.    Alternatively, spread the connections around, either by introducing a load balancer, or by varying the target options for each spawned process. See https://www.cloudera.com/documentation/enterprise/latest/topics/impala_dedicated_coordinator.html and http://www.cloudera.com/documentation/other/reference-architecture/PDF/Impala-HA-with-F5-BIG-IP.pdf for further guidance and examples on this.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-08-2019
	
		
		07:33 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							Are you looking for a sequentially growing ID or just a universally unique ID?    For the former, you can use Curator over ZooKeeper with this recipe: https://curator.apache.org/curator-recipes/distributed-atomic-long.html    For the latter, a UUID generator may suffice.    For a more 'distributed' solution, checkout Twitter's Snowflake: https://github.com/twitter-archive/snowflake/tree/snowflake-2010
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-08-2019
	
		
		07:15 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							There's no 'single' query tracking in HBase because of its distributed nature (your scan range may boil down into several different regions, hosted and served independently by several different nodes).    Access to data is audited if you enable TRACE level logging on the AccessController class, or if you use Cloudera Navigator Audit Service in your cluster. The audit information will capture the requestor and the kind of request, but not the parameters of the request.    If it is the parameters of your request (such as row ranges, filters, etc.) you're interested in, could you explain what the use-case is for recording it?
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-08-2019
	
		
		06:42 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							Running over a public IP may not be a good idea if it is open to the internet. Consider using a VPC?    That said, you can point HBase Master and RegionServer to use the address from a specific interface name (eth0, eth1, etc.) and/or a specific DNS resolver (IP or name that can answer to a dns:// resolving call) via advanced config properties:    hbase.master.dns.interface  hbase.master.dns.nameserver    hbase.regionserver.dns.interface  hbase.regionserver.dns.nameserver    By default the services will use whatever is the host's default name and resolving address: getent hosts $(hostname -f) and publish this to clients.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-07-2019
	
		
		09:58 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Depends on what you mean by 'storage locations'.    If you mean "can other apps use HDFS?" then the answer is yes, as HDFS is an independent system unrelated to YARN and has its own access and control mechanisms not governed by a YARN scheduler.    If you mean "can other apps use the scratch space on NM nodes" then the answer is no, as only local containers get to use that.    If you're looking to strictly split both storage and compute, as opposed to just some form of compute, then it may be better to divide up the cluster entirely.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-07-2019
	
		
		06:25 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Our Isilon doc page covers some of your asks, including the differences on security features (as of posting, the Isilon solution did not support ACLs, or transparent encryption), but does support Kerberos Authentication: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_isilon_service.html    > extending an existed CDH HDFS cluster with Isilon    If by extending you mean "merging" the storage under a common namespace, that is not currently possible (in 5.x/6.x).    > using of Isilon as a backup of an existed CDH HDFS cluster    Cloudera Enterprise BDR (Backup and Disaster Recovery) features support replicating to/from Isilon in addition to HDFS, so this is doable: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_pcm_bdr.html#supported_replication_isilon
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-07-2019
	
		
		06:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							Could you share your CM agent logs snippets from right after the parcel activated and the host inspector showed the missing components/users?    The users are typically created (if they do not pre-exist) by the Cloudera Manager agent when the parcel is activated for the first time.    It is possible something may have gone wrong at that step, so having the agent logs will be helpful to troubleshoot it.
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-07-2019
	
		
		05:48 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							HDFS only stores two time points in its INode data structures/persistence: The modification time and the access time [1].    For files, the mtime is effectively the time of when the file was last closed (such as when originally written and closed, or when reopened for append and closed). In general use this does not change very much for most files you'll place on HDFS and can serve as a "good enough" creation time.    Is there a specific use-case you have in mind that requires preservation of the original create time?    [1] https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeAttributes.java#L61-L65
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













