About IT.Services

IT.Services · ‎11-12-2014

I stumbled upon articles written about how to use STORE and DUMP appropriately in a pig script. It seems that I have been using a DUMP and a STORE command in our scripts to output some debugging information. Instead I should only be using the STORE command in our scripts. DUMP is used only for debugging. If you combine the two commands the script will run TWICE! From Apache (http://pig.apache.org/docs/r0.12.0/perf.html#store-dump): Store vs. Dump With multi-query exection, you want to use STORE to save (persist) your results. You do not want to use DUMP as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE. A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; DUMP B; C = FOREACH B GENERATE y, z; STORE C INTO 'output'; STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2. A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; STORE B INTO 'output1'; C = FOREACH B GENERATE y, z; STORE C INTO 'output2';

IT.Services · ‎11-12-2014

My pig script (running through Hue) fails to store the results into HDFS on the first attempt. Immediately after attempting to store the data the entire pig script restarts. The script will then complete successfully on the second attempt. Here is my pig script: offers = LOAD '/tmp/file.txt' USING PigStorage AS (tabid:CHARARRAY, offerNum:CHARARRAY); describe offers; offers5= LIMIT offers 5; dump offers5; STORE offers INTO '/tmp/folder' USING PigStorage(); I think my pig script is written poorly, can you identify why the entire script would restart? I can't find anything useful in the logs! Where can I look to try to resolve this issue?

IT.Services · ‎09-17-2014

You pointed out the problem and I removed the -Xmx825955249 from where I had entered it in Cloudera Manager. I was using the wrong field to update the value. Thank you so much for sticking with me and helping me resolve this issue! The jobs now succeed! Kevin Verhoeven

IT.Services · ‎09-16-2014

I have an easy question, if I increase mapreduce.reduce.shuffle.parallelcopies from 4 to 10, will that increase or decrease memory used by the node? It seems to me that if this is increased that data is written quickly to files and out of memory. But I might be wrong...

IT.Services · ‎09-16-2014

Here is what I have. Is the map/reduce java opts being overwritten, there are two entries?  <configuration> <property> <name>mapreduce.job.split.metainfo.maxsize</name> <value>10000000</value> </property> <property> <name>mapreduce.job.counters.max</name> <value>120</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>false</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>zlib.compress.level</name> <value>DEFAULT_COMPRESSION</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>5</value> </property> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.8</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>4</value> </property> <property> <name>mapreduce.task.timeout</name> <value>600000</value> </property> <property> <name>mapreduce.client.submit.file.replication</name> <value>4</value> </property> <property> <name>mapreduce.job.reduces</name> <value>2</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.speculative</name> <value>false</value> </property> <property> <name>mapreduce.reduce.speculative</name> <value>false</value> </property> <property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.8</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>blvdevhdp05.ds-iq.corp:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>blvdevhdp05.ds-iq.corp:19888</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1536</value> </property> <property> <name>yarn.app.mapreduce.am.resource.cpu-vcores</name> <value>1</value> </property> <property> <name>mapreduce.job.ubertask.enabled</name> <value>false</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx825955249</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx768m -Xmx825955249</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx1280m -Xmx825955249</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1280</value> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1792</value> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>1</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH</value> </property> <property> <name>mapreduce.admin.user.env</name> <value>LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH</value> </property> <property> <name>mapreduce.shuffle.max.connections</name> <value>80</value> </property> </configuration>

IT.Services · ‎09-16-2014

Thanks bcwalrus, what if I increased the mapreduce.task.io.sort.factor, which is currently set to 5? Also, do you know if it would be helpful to increase the mapreduce.reduce.java.opts.max.heap from the current setting of 787.69 MiB? Or is this not helpful?

IT.Services · ‎09-16-2014

From the Yarn logs I can see that Yarn believes that a huge amount of virtual memory is available before the job is killed, why is it using so much Virtual memory? Where is this set? 2014-09-16 10:18:30,803 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 51870 for container-id container_1410882800578_0001_01_000001: 797.0 MB of 2.5 GB physical memory used; 1.8 GB of 5.3 GB virtual memory used 2014-09-16 10:18:33,829 INFO ... org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1410882800578_0005_01_000048 2014-09-16 10:18:34,431 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin IP=192.168.210.251 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048 2014-09-16 10:18:34,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from RUNNING to KILLING 2014-09-16 10:18:34,433 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1410882800578_0005_01_000048 2014-09-16 10:18:34,462 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1410882800578_0005_01_000048 is : 143 2014-09-16 10:18:34,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2014-09-16 10:18:34,553 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space1/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048 2014-09-16 10:18:34,556 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space2/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048 2014-09-16 10:18:34,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048

IT.Services · ‎09-16-2014

Thanks for your help with this problem, I didn't know the default was unlimited. The max number of reducers for each TT was set at 2. I don't have the job counters from a big MR1 job, but I might be able to look them up. Where would I find them?

IT.Services · ‎09-16-2014

I don't think we ever changed the -Xmx on the reducers in MR1, this would have remained the default. Do you know what the default is for MR1?

IT.Services · ‎09-15-2014

Thanks bcwalrus, very good question: In MRv1, we configured the Java Heap Size of TaskTracker in Bytes with: 600 MB. Do you think I've set this too high in MRv2? I'll cut the AM memory down to 1 GB, that is good advice. That will save me some memory on the node. Kevin

Online	Offline
Last Visited	‎02-21-2018 06:44 PM

Member Since	‎06-03-2014 03:54 PM
Last Visited	‎02-21-2018 06:44 PM
Posts	62
Kudos received	3

Cloudera Community

Re: Custom Parcel will not display in the Add Serv...

Re: Unable to start Cloudera Agent

Re: LDAP integration cloudera manager

Re: Pig script fails to write output on first atte...

Re: PIG script does not work from HUE - No groups ...

Re: Pig script fails to write output on first atte...

Pig script fails to write output on first attempt

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...

Re: Jobs fail in Yarn with out of Java heap memory...