Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Execute a pig job on all nodes

avatar
Champion Alumni

Hello,

 

I have a pig job that I schedule with oozie.  This pig job is reading data from a Hive table and is writing into 3 HBase tables (UDF).

 

The problem is that only one node is working. 

 

I notice that this job has only mappers and no reducers. Is this the problem? 

 

I'm asking this because of the thread:

https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Execute-Shell-script-through-oozie-j...

 

 

where @Sue said "The Oozie shell action is run as a Hadoop job with one map task and zero reduce tasks - the job runs on one arbitrary node in the cluster."

 

Is there a way to force the cluster to use all the nodes?

 

Thank you!

 

 

GHERMAN Alina
1 ACCEPTED SOLUTION

avatar
Champion Alumni

I found out why this was happening.

 

Since I was on a DEV cluster I stopped and started the services every day. 

Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)

 

After I balanced the HBase Table the script was distributed. 

GHERMAN Alina

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

If you are using the Pig action, as opposed to a shell action, then your Pig job should run in a distributed fashion.

 

Here is a great blog post (recently updated for CDH 5) that shows a step by step example for running a Pig script as an Oozie Pig action:

https://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/

avatar
Champion Alumni

Hello,

 

Thank you for your answer. 

 

The problem is that clasic Pig scripts (no access to Hive tables, nor to HBase) are running in a distributed way (they have mappers and reducers).

 

However, this one is running only on one node

(in Cloudera Manager ->Hosts all namenodes have a Load Average of 0.* and one node has 9.* as load charge)

 

Since you say that normally, even if only mappers are created the script should run in a distributed node, I will post an anonymised version of my script. 

 

SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;
SET output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
SET hbase.zookeeper.quorum '${ZOOKEEPER_QUORUM}';
SET oozie.use.system.libpath true
SET oozie.libpath '${PATH_LIB_OOZIE}'
------------------------------------------------------------


-- hcat
register 'hive-hcatalog-core-0.13.1-cdh5.3.0.jar';
register 'hive-hcatalog-core.jar';
register 'hive-hcatalog-pig-adapter-0.13.1-cdh5.3.0.jar';
register 'hive-hcatalog-pig-adapter.jar';
register 'hive-metastore-0.13.1-cdh5.3.0.jar';
register 'datanucleus-core-3.2.10.jar';
register 'datanucleus-api-jdo-3.2.6.jar';
register 'datanucleus-rdbms-3.2.9.jar';
register 'commons-dbcp-1.4.jar';
register 'commons-pool-1.5.4.jar';
register 'jdo-api-3.0.1.jar';

-- UDF
REGISTER 'MyStoreUDF-0.3.8.jar';

------------------------------------------------------------------------------------------------------------
----------------------------------------------- input data -------------------------------------------------

var_a= LOAD 'my_database.my_table' USING org.apache.hcatalog.pig.HCatLoader() as 
			(
                        a:chararray ,
                        b:chararray,
                        c:chararray,
			 d:chararray,
			 e:chararray,
			 f:long,
			 g:chararray,
			h:chararray,
			 i:long,
			 j:chararray,
			 k:bag{((name:chararray,value:chararray))},
                         l:chararray,
                         m:chararray  );

var_a_filtered= FILTER sessions BY (a== 'abcd' );

var_a_proj= FOREACH var_a_filteredGENERATE
                        a,
			b,
                        c,
                        d;

 STORE var_a_proj INTO 'hbaseTableName' 
 USING MyStoreUDF('-hbaseTableName1 hbaseTableName1 -hbaseTableName2 -hbaseTableName2 ');

Thank you!

 

Alina GHERMAN

 

GHERMAN Alina

avatar
Champion Alumni

I found out why this was happening.

 

Since I was on a DEV cluster I stopped and started the services every day. 

Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)

 

After I balanced the HBase Table the script was distributed. 

GHERMAN Alina