Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

best way to do mapreduce

avatar
Master Collaborator

Hi:

for ETL which is the best tecnologie NOT Streaming to make a script, actually iam using Pig but i dont know if for batching is ok.

please any suggestion??

IN the future ill do it with spark.

Thanks

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi:

I have resolve the problem, i have remove some jars from my script:

--register /usr/lib/piggybank/google-collections-1.0.jar;
register /usr/lib/piggybank/piggybank.jar;
register /usr/lib/piggybank/elephant-bird-core-4.1.jar;
register /usr/lib/piggybank/elephant-bird-pig-4.1.jar;
register /usr/lib/piggybank/elephant-bird-hadoop-compat-4.1.jar;
register /usr/lib/piggybank/json-simple.jar;
register /usr/lib/piggybank/libfb303.jar;


SET mapreduce.input.fileinputformat.split.minsize 107520;
SET mapreduce.input.fileinputformat.split.maxsize 276480;
SET default_parallel 5;
SET exectype tez;


View solution in original post

8 REPLIES 8

avatar

@Roberto Sancho Pig is a good tool to use for ETL and data warehouse type of processing on your data. It provides an abstraction layer for the underlying processing engine (MR or Tez). You can use Tez as the execution engine to speed up processing. This Pig Tutorial has additional information.

avatar

Hi @Roberto Sancho,

You can use Hive or Pig for doing ETL. In HDP, Hive and Pig run on Tez and not on MapReduce. This gives you a much better performance.

You can use Spark too as you stated.

avatar
Master Collaborator

Hi:

I execute pig with hive and y receive this error, please any one could help me??

Container exited with a non-zero exit code 255
]]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:2 killedTasks:62, Vertex vertex_1461739406783_0064_1_00 [scope-67] killed/failed due to:OWN_TASK_FAILURE]
Vertex killed, vertexName=scope-94, vertexId=vertex_1461739406783_0064_1_04, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1461739406783_0064_1_04 [scope-94] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-92, vertexId=vertex_1461739406783_0064_1_03, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1461739406783_0064_1_03 [scope-92] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-82, vertexId=vertex_1461739406783_0064_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1461739406783_0064_1_02 [scope-82] killed/failed due to:OTHER_VERTEX_FAILURE]
Vertex killed, vertexName=scope-73, vertexId=vertex_1461739406783_0064_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:45, Vertex vertex_1461739406783_0064_1_01 [scope-73] killed/failed due to:OTHER_VERTEX_FAILURE]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:4
        at org.apache.pig.tools.pigstats.tez.TezPigScriptStats.accumulateStats(TezPigScriptStats.java:193)
        at org.apache.pig.backend.hadoop.executionengine.tez.TezJob.run(TezJob.java:198)
        at org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher$1.run(TezLauncher.java:195)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


avatar
Expert Contributor

Can you share more info on the query submitted, possibly share application logs? What is the version of hive?

avatar
Master Collaborator

Hi:

I am using a pig script with this parameter

SET exectype tez;

avatar
Master Collaborator

Hi:

The error on the log

2016-04-28 11:18:22,146 [PigTezLauncher-0] INFO  org.apache.tez.client.TezClient - The url to track the Tez Session: http://lnxbig05.cajarural.gcr:8088/proxy/application_1461739406783_0071/
2016-04-28 11:18:26,770 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitting DAG PigLatin:mario.pig-0_scope-0
2016-04-28 11:18:26,770 [PigTezLauncher-0] INFO  org.apache.tez.client.TezClient - Submitting dag to TezSession, sessionName=PigLatin:mario.pig, applicationId=application_1461739406783_0071, dagName=PigLatin:mario.pig-0_scope-0, callerContext={ context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-mario.pig-b0f06568-7cba-4e19-aab2-128bc7afb536 }
2016-04-28 11:18:26,778 [PigTezLauncher-0] INFO  org.apache.tez.dag.api.DAG - Inferring parallelism for vertex: scope-92 to be 5 from 1-1 connection with vertex scope-73
2016-04-28 11:18:27,618 [PigTezLauncher-0] INFO  org.apache.tez.client.TezClient - Submitted dag to TezSession, sessionName=PigLatin:mario.pig, applicationId=application_1461739406783_0071, dagName=PigLatin:mario.pig-0_scope-0
2016-04-28 11:18:27,772 [PigTezLauncher-0] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://lnxbig06.cajarural.gcr:8188/ws/v1/timeline/
2016-04-28 11:18:27,772 [PigTezLauncher-0] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at lnxbig05.cajarural.gcr/10.1.246.19:8050
2016-04-28 11:18:27,782 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - Submitted DAG PigLatin:mario.pig-0_scope-0. Application id: application_1461739406783_0071
2016-04-28 11:18:28,007 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher - HadoopJobId: job_1461739406783_0071
2016-04-28 11:18:28,783 [Timer-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: status=RUNNING, progress=TotalTasks: 102 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null


og Type: syslog
Log Upload Time: Thu Apr 28 11:24:42 +0200 2016
Log Length: 1552
2016-04-28 11:24:41,246 [ERROR] [main] |app.DAGAppMaster|: Error starting DAGAppMaster
java.lang.NoSuchMethodError: com.google.common.collect.MapMaker.keyEquivalence(Lcom/google/common/base/Equivalence;)Lcom/google/common/collect/MapMaker;
	at com.google.common.collect.Interners$WeakInterner.<init>(Interners.java:68)
	at com.google.common.collect.Interners$WeakInterner.<init>(Interners.java:66)
	at com.google.common.collect.Interners.newWeakInterner(Interners.java:63)
	at org.apache.hadoop.util.StringInterner.<clinit>(StringInterner.java:49)
	at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2600)
	at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
	at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
	at org.apache.hadoop.conf.Configuration.get(Configuration.java:1232)
	at org.apache.hadoop.yarn.factory.providers.RecordFactoryProvider.getRecordFactory(RecordFactoryProvider.java:49)
	at org.apache.hadoop.yarn.util.Records.<clinit>(Records.java:32)
	at org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:49)
	at org.apache.hadoop.yarn.api.records.ContainerId.toApplicationAttemptId(ContainerId.java:249)
	at org.apache.hadoop.yarn.api.records.ContainerId.toApplicationAttemptId(ContainerId.java:244)
	at org.apache.hadoop.yarn.api.records.ContainerId.fromString(ContainerId.java:223)
	at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:179)
	at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:2040)

avatar
Master Collaborator

Hi:

I have resolve the problem, i have remove some jars from my script:

--register /usr/lib/piggybank/google-collections-1.0.jar;
register /usr/lib/piggybank/piggybank.jar;
register /usr/lib/piggybank/elephant-bird-core-4.1.jar;
register /usr/lib/piggybank/elephant-bird-pig-4.1.jar;
register /usr/lib/piggybank/elephant-bird-hadoop-compat-4.1.jar;
register /usr/lib/piggybank/json-simple.jar;
register /usr/lib/piggybank/libfb303.jar;


SET mapreduce.input.fileinputformat.split.minsize 107520;
SET mapreduce.input.fileinputformat.split.maxsize 276480;
SET default_parallel 5;
SET exectype tez;


avatar
Master Collaborator

Hi:

finally it works with this:

register /usr/lib/piggybank/piggybank.jar;
register /usr/lib/piggybank/elephant-bird-core-4.1.jar;
register /usr/lib/piggybank/elephant-bird-pig-4.1.jar;
register /usr/lib/piggybank/elephant-bird-hadoop-compat-4.1.jar;
register /usr/lib/piggybank/json-simple.jar;
register /usr/lib/piggybank/libfb303.jar;

SET mapreduce.input.fileinputformat.split.minsize 107520;
SET mapreduce.input.fileinputformat.split.maxsize 276480;
SET exectype tez;