About ludof

ludof · ‎12-16-2018

Hi @csguna, CDH version is 5.13.2

ludof · ‎12-16-2018

Hi @Jerry, thank you for the reply. If I understand correctly you are saying that if not explicitly specified values for mapreduce.map.memory.mb and mapreduce.reduce.memory.mb YARN will assign to the job the minimum container memory value yarn.scheduler.minimum-allocation-mb, (1 GB in this case) ? Because from what I can read in the description fields on the Cloudera Manager, I though that if the values for mapreduce.map.memory.mb and mapreduce.reduce.memory.mb are left to zero, the memory assigned to a job should be inferred by the map maximum heap and heap to container ratio: Could you explain please how this work?

ludof · ‎12-14-2018

Hi everyone, I have a cluster where each worker has 110 GB of RAM. On the Cloudera Manager I've configured the following Yarn memory parameters: yarn.nodemanager.resource.memory-mb 80 GB yarn.scheduler.minimum-allocation-mb 1 GB yarn.scheduler.maximum-allocation-mb 20 GB mapreduce.map.memory.mb 0 mapreduce.reduce.memory.mb 0 yarn.app.mapreduce.am.resource.mb 1 GB mapreduce.job.heap.memory-mb.ratio 0,8 mapreduce.map.java.opts -Djava.net.preferIPv4Stack=true mapreduce.reduce.java.opts -Djava.net.preferIPv4Stack=true Map Task Maximum Heap Size 0 Reduce Task Maximum Heap Size 0 One of my goal was to let YARN to autochoose the correct Java Heap size for the jobs using the 0,8 ratio as the upperbound (20 GB * 0,8 = 16 GB), thus I've leave all the heap and mapper/reducer settings to zero. I have this hive job which perfoms some joins between large tables. Just running the job as it is I get a failure: Container [pid=26783,containerID=container_1389136889967_0009_01_000002] is running beyond physical memory limits. Current usage: 2.7 GB of 2 GB physical memory used; 3.7 GB of 3 GB virtual memory used. Killing container. If I explicitly set the memory requirements for the job in the hive code, it completes succesfully: SET mapreduce.map.memory.mb=8192; SET mapreduce.reduce.memory.mb=16384; SET mapreduce.map.java.opts=-Xmx6553m; SET mapreduce.reduce.java.opts=-Xmx13106m; My question: why does not YARN automatically gives this job enough memory to complete succesfully? Since I have specified 20 GB as the maximum container size and 0,8 as the maximum heap ratio, I was expecting that YARN could give a max of 16 GB to each mapper/reducer without have to me esplicitly specify these parameters. Could someone please explain what's going on? Thanks for any information.

ludof · ‎11-26-2018

Thank you very much @Harsh J! If I got it correctly these parameters oozie.launcher.mapreduce.map.java.opts oozie.launcher.mapreduce.reduce.java.opts oozie.launcher.yarn.app.mapreduce.am.command-opts control the maximum amount of memory allocated for the Oozie launcher. What are the equivalent parameters to control the memory allocated for the action instead (e.g. a Sqoop action), as shown in the image?

ludof · ‎11-20-2018

Hi @Harsh J, thank you very much for these informations (I am using Oozie server build version: 4.1.0-cdh5.13.2)! So if I understand correctly I need to add two properties in the oozie actions configuration, one specifying the launcher queue and one specifying the job queue. Below it is shown a sqoop action where I have added these two properties (in bold): <action name="DLT01V_VPAXINF_IMPORT_ACTION"> <sqoop xmlns="uri:oozie:sqoop-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.launcher.mapred.job.queue.name</name> <value>oozie_launcher_queue</value> </property> <property> <name>mapred.job.queue.name</name> <value>job_queue</value> </property> <property> <name>oozie.launcher.mapreduce.map.java.opts</name> <value>-Xmx4915m</value> </property> <property> <name>oozie.launcher.mapreduce.reduce.java.opts</name> <value>-Xmx9830m</value> </property> <property> <name>oozie.launcher.yarn.app.mapreduce.am.command-opts</name> <value>-Xmx4915m</value> </property> </configuration> [...] </sqoop> [...] </action> I have some questions: Do I need to define the queues "oozie_launcher_queue" and "job_queue" somewhere on the CDH or can I just use them providing the names? If yes, how should I define these queues? There are recommended settings? In case of a Spark action, do I still to specify the queue? If yes, with which property (since Spark does not use MapReduce)? Does it make sense to specify values for oozie.launcher.mapreduce.map.java.opts, oozie.launcher.mapreduce.reduce.java.opts, oozie.launcher.yarn.app.mapreduce.am.command-opts as I did in the example? I am asking because I've noticed in the Yarn ResourceManager that the Oozie launchers take a big amount of memory (about 30 GB each), is this normal? Thank you for the support!

ludof · ‎11-19-2018

Hello everyone! I have a typical scenario where there are multiple pipelines running on Oozie, each one with different dependencies and time schedules. These pipelines comprise different jobs like Hive, Spark, Java etc. Many of these jobs are heavy on memory, the cluster has a total of 840 GB of RAM, so let's say that the memory is enough to complete any of these jobs but could not be enough to allow several of these jobs to run and complete at the same time. Sometimes happen that few of these jobs need to run concurrently, in this case I've noticed a sort of starvation in YARN. None of the jobs continues the execution, there are a lot of heartbeats in the logs, and none eventually completes. YARN is set to use the Fair Scheduler, I would imagine that in a situation like this it should give resources at least to one of the job but it seems that all the jobs are fighting for resources and YARN is not capable to handle the impasse. I would like to know which are the best practices to handle these type of scenarios. Do I need to define different YARN queues with different resources/priority (actually all the jobs run on the default queue)?

ludof · ‎08-01-2018

Hello everyone, I have a Spark application which runs fine with test tables but fails in production where there are tables with 200 million records and about 100 columns. From the logs the error seems related to Snappy codec, although these tables have been saved in Parquet without compression, and also at write time I have explicitly turned off compression with: sqlContext.sql("SET hive.exec.compress.output=false") sqlContext.sql("SET parquet.compression=NONE") sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed") The error is the following: 2018-08-01 16:19:45,467 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ShuffleMapStage 183 (saveAsTable at Model1Prep.scala:776) failed in 543.126 s due to Job aborted due to stage failure: Task 169 in stage 97.0 failed 4 times, most recent failure: Lost task 169.3 in stage 97.0 (TID 15079, prwor-e414c813.azcloud.local, executor 2): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.<init>(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1280) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:54) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:148) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:416) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:117) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Why is this happenig if compression is turned off? Could it be compression is used anyway during shuffle phases? The cluster has the following characteristics: 2 master nodes 7 worker nodes Each node has: cpu: 16 cores ram: 110GB hdfs disks: 4x1TB These are the YARN settings for memory (GB): yarn.nodemanager.resource.memory-mb 84 yarn.scheduler.minimum-allocation-mb 12 yarn.scheduler.maximum-allocation-mb 84 mapreduce.map.memory.mb 6 mapreduce.reduce.memory.mb 12 mapreduce.map.java.opts 4,8 mapreduce.reduce.java.opts 9,6 yarn.app.mapreduce.am.resource.mb 6 yarn.app.mapreduce.am.command-opts 4,8 yarn.scheduler.maximum-allocation-vcores 5 SPARK on YARN settings: spark.shuffle.service.enabled: ENABLED spark.dynamicAllocation.enabled: ENABLED SPARK job submission settings: --driver-memory 30G --executor-cores 5 --executor-memory 30G Has anyone any hint on why is this happening?

ludof · ‎07-24-2018

You can use JSON Serde. You have to create the table with a structure that maps the structure of the json. For example: data.json {"X": 134, "Y": 55, "labels": ["L1", "L2"]} {"X": 11, "Y": 166, "labels": ["L1", "L3", "L4"]} create table CREATE TABLE Point ( X INT, Y INT, labels ARRAY<STRING> ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION 'path/to/table'; Then you should upload your json file in the location path of the table, giving the right permissions and you are good to go.

ludof · ‎05-23-2018

Thanks, I indeed end up using Maven and plugins.d folder on Flume. Forgot to update the topic, thank you guys for the help!

ludof · ‎05-14-2018

Thanks @Harsh J, indeed I've finally solved using hdfs://hanameservice for name node and yarnrm for the job tracker.

Online	Offline
Last Visited	‎12-21-2018 06:29 AM

Member Since	‎11-24-2017 01:33 AM
Last Visited	‎12-21-2018 06:29 AM
Posts	76
Kudos received	7

Cloudera Community

Re: Oozie with HDFS High Availability

Re: Invalidate metadata using Cloudera Impala JDBC...

Re: Cloudera Manager: oozie.service.WorkflowAppSer...

Re: Oozie Sqoop actions fails when importing data ...

Re: Oozie Sqoop action fails on --hive-import

Re: Doubts about YARN memory configuration in Clou...

Re: Doubts about YARN memory configuration in Clou...

Doubts about YARN memory configuration in Cloudera...

Re: Best practices to correctly handle multiple co...

Re: Best practices to correctly handle multiple co...

Best practices to correctly handle multiple concur...

Spark on YARN: Snappy java.io.IOException: FAILED_...

Re: How to create hive table out of JSON Schema

Re: Flume custom source

Re: Oozie with HDFS High Availability