Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Execute a pig job on all nodes

Solved Go to solution
Highlighted

Execute a pig job on all nodes

Champion Alumni

Hello,

 

I have a pig job that I schedule with oozie.  This pig job is reading data from a Hive table and is writing into 3 HBase tables (UDF).

 

The problem is that only one node is working. 

 

I notice that this job has only mappers and no reducers. Is this the problem? 

 

I'm asking this because of the thread:

https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Execute-Shell-script-through-oozie-j...

 

 

where @Sue said "The Oozie shell action is run as a Hadoop job with one map task and zero reduce tasks - the job runs on one arbitrary node in the cluster."

 

Is there a way to force the cluster to use all the nodes?

 

Thank you!

 

 

GHERMAN Alina
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Execute a pig job on all nodes

Champion Alumni

I found out why this was happening.

 

Since I was on a DEV cluster I stopped and started the services every day. 

Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)

 

After I balanced the HBase Table the script was distributed. 

GHERMAN Alina
3 REPLIES 3

Re: Execute a pig job on all nodes

Rising Star

If you are using the Pig action, as opposed to a shell action, then your Pig job should run in a distributed fashion.

 

Here is a great blog post (recently updated for CDH 5) that shows a step by step example for running a Pig script as an Oozie Pig action:

https://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/

Re: Execute a pig job on all nodes

Champion Alumni

Hello,

 

Thank you for your answer. 

 

The problem is that clasic Pig scripts (no access to Hive tables, nor to HBase) are running in a distributed way (they have mappers and reducers).

 

However, this one is running only on one node

(in Cloudera Manager ->Hosts all namenodes have a Load Average of 0.* and one node has 9.* as load charge)

 

Since you say that normally, even if only mappers are created the script should run in a distributed node, I will post an anonymised version of my script. 

 

SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;
SET output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
SET hbase.zookeeper.quorum '${ZOOKEEPER_QUORUM}';
SET oozie.use.system.libpath true
SET oozie.libpath '${PATH_LIB_OOZIE}'
------------------------------------------------------------


-- hcat
register 'hive-hcatalog-core-0.13.1-cdh5.3.0.jar';
register 'hive-hcatalog-core.jar';
register 'hive-hcatalog-pig-adapter-0.13.1-cdh5.3.0.jar';
register 'hive-hcatalog-pig-adapter.jar';
register 'hive-metastore-0.13.1-cdh5.3.0.jar';
register 'datanucleus-core-3.2.10.jar';
register 'datanucleus-api-jdo-3.2.6.jar';
register 'datanucleus-rdbms-3.2.9.jar';
register 'commons-dbcp-1.4.jar';
register 'commons-pool-1.5.4.jar';
register 'jdo-api-3.0.1.jar';

-- UDF
REGISTER 'MyStoreUDF-0.3.8.jar';

------------------------------------------------------------------------------------------------------------
----------------------------------------------- input data -------------------------------------------------

var_a= LOAD 'my_database.my_table' USING org.apache.hcatalog.pig.HCatLoader() as 
			(
                        a:chararray ,
                        b:chararray,
                        c:chararray,
			 d:chararray,
			 e:chararray,
			 f:long,
			 g:chararray,
			h:chararray,
			 i:long,
			 j:chararray,
			 k:bag{((name:chararray,value:chararray))},
                         l:chararray,
                         m:chararray  );

var_a_filtered= FILTER sessions BY (a== 'abcd' );

var_a_proj= FOREACH var_a_filteredGENERATE
                        a,
			b,
                        c,
                        d;

 STORE var_a_proj INTO 'hbaseTableName' 
 USING MyStoreUDF('-hbaseTableName1 hbaseTableName1 -hbaseTableName2 -hbaseTableName2 ');

Thank you!

 

Alina GHERMAN

 

GHERMAN Alina

Re: Execute a pig job on all nodes

Champion Alumni

I found out why this was happening.

 

Since I was on a DEV cluster I stopped and started the services every day. 

Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)

 

After I balanced the HBase Table the script was distributed. 

GHERMAN Alina