Member since
06-03-2014
62
Posts
3
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2954 | 11-30-2017 10:32 AM | |
4903 | 01-20-2016 05:08 PM | |
2266 | 01-13-2015 02:42 PM | |
4629 | 11-12-2014 11:09 AM | |
10990 | 08-20-2014 09:29 AM |
11-12-2014
11:09 AM
I stumbled upon articles written about how to use STORE and DUMP appropriately in a pig script. It seems that I have been using a DUMP and a STORE command in our scripts to output some debugging information. Instead I should only be using the STORE command in our scripts. DUMP is used only for debugging. If you combine the two commands the script will run TWICE! From Apache (http://pig.apache.org/docs/r0.12.0/perf.html#store-dump): Store vs. Dump With multi-query exection, you want to use STORE to save (persist) your results. You do not want to use DUMP as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE. A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output'; STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2. A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
STORE B INTO 'output1';
C = FOREACH B GENERATE y, z;
STORE C INTO 'output2';
... View more
11-12-2014
10:35 AM
My pig script (running through Hue) fails to store the results into HDFS on the first attempt. Immediately after attempting to store the data the entire pig script restarts. The script will then complete successfully on the second attempt. Here is my pig script: offers = LOAD '/tmp/file.txt' USING PigStorage AS (tabid:CHARARRAY, offerNum:CHARARRAY); describe offers; offers5= LIMIT offers 5; dump offers5; STORE offers INTO '/tmp/folder' USING PigStorage(); I think my pig script is written poorly, can you identify why the entire script would restart? I can't find anything useful in the logs! Where can I look to try to resolve this issue?
... View more
Labels:
- Labels:
-
Apache Pig
-
Cloudera Hue
-
HDFS
09-17-2014
11:25 AM
You pointed out the problem and I removed the -Xmx825955249 from where I had entered it in Cloudera Manager. I was using the wrong field to update the value. Thank you so much for sticking with me and helping me resolve this issue! The jobs now succeed! Kevin Verhoeven
... View more
09-16-2014
12:05 PM
I have an easy question, if I increase mapreduce.reduce.shuffle.parallelcopies from 4 to 10, will that increase or decrease memory used by the node? It seems to me that if this is increased that data is written quickly to files and out of memory. But I might be wrong...
... View more
09-16-2014
11:42 AM
Here is what I have. Is the map/reduce java opts being overwritten, there are two entries? <!--Autogenerated by Cloudera Manager--> <configuration> <property> <name>mapreduce.job.split.metainfo.maxsize</name> <value>10000000</value> </property> <property> <name>mapreduce.job.counters.max</name> <value>120</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>false</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.DefaultCodec</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>zlib.compress.level</name> <value>DEFAULT_COMPRESSION</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>5</value> </property> <property> <name>mapreduce.map.sort.spill.percent</name> <value>0.8</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>4</value> </property> <property> <name>mapreduce.task.timeout</name> <value>600000</value> </property> <property> <name>mapreduce.client.submit.file.replication</name> <value>4</value> </property> <property> <name>mapreduce.job.reduces</name> <value>2</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.speculative</name> <value>false</value> </property> <property> <name>mapreduce.reduce.speculative</name> <value>false</value> </property> <property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.8</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>blvdevhdp05.ds-iq.corp:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>blvdevhdp05.ds-iq.corp:19888</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.staging-dir</name> <value>/user</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1536</value> </property> <property> <name>yarn.app.mapreduce.am.resource.cpu-vcores</name> <value>1</value> </property> <property> <name>mapreduce.job.ubertask.enabled</name> <value>false</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx825955249</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx768m -Xmx825955249</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Djava.net.preferIPv4Stack=true -Xmx1280m -Xmx825955249</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1280</value> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1792</value> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>1</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH</value> </property> <property> <name>mapreduce.admin.user.env</name> <value>LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH</value> </property> <property> <name>mapreduce.shuffle.max.connections</name> <value>80</value> </property> </configuration>
... View more
09-16-2014
10:38 AM
Thanks bcwalrus, what if I increased the mapreduce.task.io.sort.factor, which is currently set to 5? Also, do you know if it would be helpful to increase the mapreduce.reduce.java.opts.max.heap from the current setting of 787.69 MiB? Or is this not helpful?
... View more
09-16-2014
10:23 AM
From the Yarn logs I can see that Yarn believes that a huge amount of virtual memory is available before the job is killed, why is it using so much Virtual memory? Where is this set? 2014-09-16 10:18:30,803 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 51870 for container-id container_1410882800578_0001_01_000001: 797.0 MB of 2.5 GB physical memory used; 1.8 GB of 5.3 GB virtual memory used
2014-09-16 10:18:33,829 INFO ... org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1410882800578_0005_01_000048
2014-09-16 10:18:34,431 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin IP=192.168.210.251 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048
2014-09-16 10:18:34,432 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from RUNNING to KILLING
2014-09-16 10:18:34,433 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1410882800578_0005_01_000048
2014-09-16 10:18:34,462 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1410882800578_0005_01_000048 is : 143
2014-09-16 10:18:34,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410882800578_0005_01_000048 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2014-09-16 10:18:34,553 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space1/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048
2014-09-16 10:18:34,556 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /space2/yarn/nm/usercache/admin/appcache/application_1410882800578_0005/container_1410882800578_0005_01_000048
2014-09-16 10:18:34,558 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=admin OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1410882800578_0005 CONTAINERID=container_1410882800578_0005_01_000048
... View more
09-16-2014
09:36 AM
Thanks for your help with this problem, I didn't know the default was unlimited. The max number of reducers for each TT was set at 2. I don't have the job counters from a big MR1 job, but I might be able to look them up. Where would I find them?
... View more
09-16-2014
08:32 AM
I don't think we ever changed the -Xmx on the reducers in MR1, this would have remained the default. Do you know what the default is for MR1?
... View more
09-15-2014
10:30 AM
Thanks bcwalrus, very good question: In MRv1, we configured the Java Heap Size of TaskTracker in Bytes with: 600 MB. Do you think I've set this too high in MRv2? I'll cut the AM memory down to 1 GB, that is good advice. That will save me some memory on the node. Kevin
... View more