Member since 
    
	
		
		
		11-24-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                76
            
            
                Posts
            
        
                8
            
            
                Kudos Received
            
        
                5
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3425 | 05-14-2018 10:28 AM | |
| 6308 | 03-28-2018 12:19 AM | |
| 3229 | 02-07-2018 02:54 AM | |
| 3558 | 01-26-2018 03:41 AM | |
| 4855 | 01-05-2018 02:06 AM | 
			
    
	
		
		
		12-16-2018
	
		
		08:52 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							    Hi @csguna, CDH version is 5.13.2 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-16-2018
	
		
		01:24 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Jerry, thank you for the reply.     If I understand correctly you are saying that if not explicitly specified values for mapreduce.map.memory.mb and mapreduce.reduce.memory.mb YARN will assign to the job the minimum container memory value yarn.scheduler.minimum-allocation-mb, (1 GB in this case) ?     Because from what I can read in the description fields on the Cloudera Manager, I though that if the values for mapreduce.map.memory.mb and mapreduce.reduce.memory.mb are left to zero, the memory assigned to a job should be inferred by the map maximum heap and heap to container ratio:            Could you explain please how this work?             
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-14-2018
	
		
		02:41 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							   
 Hi everyone, I have a cluster where each worker has 110 GB of RAM. 
 On the Cloudera Manager I've configured the following Yarn memory parameters: 
   
 
 
 
 yarn.nodemanager.resource.memory-mb 
 80 GB 
 
 
 yarn.scheduler.minimum-allocation-mb 
 1 GB 
 
 
 yarn.scheduler.maximum-allocation-mb 
 20 GB 
 
 
 mapreduce.map.memory.mb 
 0 
 
 
 mapreduce.reduce.memory.mb 
 0 
 
 
 yarn.app.mapreduce.am.resource.mb 
 1 GB 
 
 
 mapreduce.job.heap.memory-mb.ratio 
 0,8 
 
 
 mapreduce.map.java.opts 
 -Djava.net.preferIPv4Stack=true 
 
 
 mapreduce.reduce.java.opts 
 -Djava.net.preferIPv4Stack=true 
 
 
 Map Task Maximum Heap Size 
 0 
 
 
 Reduce Task Maximum Heap Size 
 0 
 
 
 
   
   
 One of my goal was to let YARN to autochoose the correct Java Heap size for the jobs using the 0,8 ratio as the upperbound (20 GB * 0,8 =  16 GB), thus I've leave all the heap and mapper/reducer settings to zero. 
   
 I have this hive job which perfoms some joins between large tables. Just running the job as it is I get a failure: 
   
 Container [pid=26783,containerID=container_1389136889967_0009_01_000002] is running   beyond physical memory limits.   Current usage: 2.7 GB of 2 GB physical memory used; 3.7 GB of 3 GB virtual memory used.   Killing container. 
 If I explicitly set the memory requirements for the job in the hive code, it completes succesfully: 
   
 SET mapreduce.map.memory.mb=8192;
SET mapreduce.reduce.memory.mb=16384;
SET mapreduce.map.java.opts=-Xmx6553m;
SET mapreduce.reduce.java.opts=-Xmx13106m; 
 My question: why does not YARN automatically gives this job enough memory to complete succesfully? 
 Since I have specified 20 GB as the maximum container size and 0,8 as the maximum heap ratio, I was expecting that YARN could give a max of 16 GB to each mapper/reducer without have to me esplicitly specify these parameters. 
   
 Could someone please explain what's going on? 
   
 Thanks for any information. 
   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		11-26-2018
	
		
		01:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thank you very much @Harsh J!     If I got it correctly these parameters      oozie.launcher.mapreduce.map.java.opts  oozie.launcher.mapreduce.reduce.java.opts  oozie.launcher.yarn.app.mapreduce.am.command-opts      control the maximum amount of memory allocated for the Oozie launcher.  What are the equivalent parameters to control the memory allocated for the action instead (e.g. a Sqoop action), as shown in the image?                    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-20-2018
	
		
		02:19 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							    Hi @Harsh J, thank you very much for these informations (I am using Oozie server build version: 4.1.0-cdh5.13.2)!  So if I understand correctly I need to add two properties in the oozie actions configuration, one specifying the launcher queue and one specifying the job queue.  Below it is shown a sqoop action where I have added these two properties (in bold):           <action name="DLT01V_VPAXINF_IMPORT_ACTION">
   <sqoop xmlns="uri:oozie:sqoop-action:0.2">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration>
         <property>
            <name>oozie.launcher.mapred.job.queue.name</name>
            <value>oozie_launcher_queue</value>
         </property>
         <property>
            <name>mapred.job.queue.name</name>
            <value>job_queue</value>
         </property>
         
         <property>
            <name>oozie.launcher.mapreduce.map.java.opts</name>
            <value>-Xmx4915m</value>
         </property>
         <property>
            <name>oozie.launcher.mapreduce.reduce.java.opts</name>
            <value>-Xmx9830m</value>
         </property>
         <property>
            <name>oozie.launcher.yarn.app.mapreduce.am.command-opts</name>
            <value>-Xmx4915m</value>
         </property>
      </configuration>
      [...]
   </sqoop>
   [...]
</action>     I have some questions:      Do I need to define the queues "oozie_launcher_queue" and "job_queue" somewhere on the CDH or can I just use them providing the names? If yes, how should I define these queues? There are recommended settings?  In case of a Spark action, do I still to specify the queue? If yes, with which property (since Spark does not use MapReduce)?  Does it make sense to specify values for oozie.launcher.mapreduce.map.java.opts, oozie.launcher.mapreduce.reduce.java.opts, oozie.launcher.yarn.app.mapreduce.am.command-opts as I did in the example? I am asking because I've noticed in the Yarn ResourceManager that the Oozie launchers take a big amount of memory (about 30 GB each), is this normal?      Thank you for the support!    
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-19-2018
	
		
		02:24 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							    Hello everyone!     I have a typical scenario where there are multiple pipelines running on Oozie, each one with different dependencies and time schedules. These pipelines comprise different jobs like Hive, Spark, Java etc. Many of these jobs are heavy on memory, the cluster has a total of 840 GB of RAM, so let's say that the memory is enough to complete any of these jobs but could not be enough to allow several of these jobs to run and complete at the same time.     Sometimes happen that few of these jobs need  to run concurrently, in this case I've noticed a sort of starvation in YARN. None of the jobs continues the execution, there are a lot of heartbeats in the logs, and none eventually completes.  YARN is set to use the Fair Scheduler, I would imagine that in a situation like this it should give resources at least to one of the job but it seems that all the jobs are fighting for resources and YARN is not capable to handle the impasse.     I would like to know which are the best practices to handle these type of scenarios. Do I need to define different YARN queues with different resources/priority (actually all the jobs run on the default queue)?          
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Oozie
- 
						
							
		
			Apache YARN
			
    
	
		
		
		08-01-2018
	
		
		01:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hello everyone, I have a Spark application which runs fine with test tables but fails in production where there are tables with 200 million records and about 100 columns. From the logs the error seems related to Snappy codec, although these tables have been saved in Parquet without compression, and also at write time I have explicitly turned off compression with:         sqlContext.sql("SET hive.exec.compress.output=false")
sqlContext.sql("SET parquet.compression=NONE")
sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")        The error is the following:        2018-08-01 16:19:45,467 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler  - ShuffleMapStage 183 (saveAsTable at Model1Prep.scala:776) failed in 543.126 s due to Job aborted due to stage failure: Task 169 in stage 97.0 failed 4 times, most recent failure: Lost task 169.3 in stage 97.0 (TID 15079, prwor-e414c813.azcloud.local, executor 2): java.io.IOException: FAILED_TO_UNCOMPRESS(5)
	at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
	at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
	at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
	at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
	at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
	at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
	at org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58)
	at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
	at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1280)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:54)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:148)
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:416)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:117)
	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
	at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
	at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)        Why is this happenig if compression is turned off? Could it be compression is used anyway during shuffle phases?        The cluster has the following characteristics:   2 master nodes  7 worker nodes      Each node has:   cpu: 16 cores  ram: 110GB  hdfs disks: 4x1TB      These are the YARN settings for memory (GB):        yarn.nodemanager.resource.memory-mb  84    yarn.scheduler.minimum-allocation-mb  12    yarn.scheduler.maximum-allocation-mb  84    mapreduce.map.memory.mb  6    mapreduce.reduce.memory.mb  12    mapreduce.map.java.opts  4,8    mapreduce.reduce.java.opts  9,6    yarn.app.mapreduce.am.resource.mb  6    yarn.app.mapreduce.am.command-opts  4,8    yarn.scheduler.maximum-allocation-vcores  5           SPARK on YARN settings:   spark.shuffle.service.enabled: ENABLED  spark.dynamicAllocation.enabled: ENABLED      SPARK job submission settings:    --driver-memory 30G  --executor-cores 5  --executor-memory 30G      Has anyone any hint on why is this happening?                         
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
- 
						
							
		
			Apache YARN
			
    
	
		
		
		07-24-2018
	
		
		06:59 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You can use  JSON Serde. You have to create the table with a structure that maps the structure of the json.   For example:     data.json  {"X": 134, "Y": 55, "labels": ["L1", "L2"]}
{"X": 11, "Y": 166, "labels": ["L1", "L3", "L4"]}  create table  CREATE TABLE Point
(
    X INT,
    Y INT,
    labels ARRAY<STRING>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION 'path/to/table';  Then you should upload your json file in the location path of the table, giving the right permissions and you are good to go.       
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-23-2018
	
		
		01:26 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks, I indeed end up using Maven and plugins.d folder on Flume. Forgot to update the topic, thank you guys for the help! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-14-2018
	
		
		10:28 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thanks @Harsh J, indeed I've finally solved using hdfs://hanameservice for name node and yarnrm for the job tracker.        
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        












