Created on 11-04-2015 11:29 PM - edited 09-16-2022 02:47 AM
Hello,
I have a pig job that I schedule with oozie. This pig job is reading data from a Hive table and is writing into 3 HBase tables (UDF).
The problem is that only one node is working.
I notice that this job has only mappers and no reducers. Is this the problem?
I'm asking this because of the thread:
where @Sue said "The Oozie shell action is run as a Hadoop job with one map task and zero reduce tasks - the job runs on one arbitrary node in the cluster."
Is there a way to force the cluster to use all the nodes?
Thank you!
Created 02-05-2016 04:49 AM
I found out why this was happening.
Since I was on a DEV cluster I stopped and started the services every day.
Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)
After I balanced the HBase Table the script was distributed.
Created 11-05-2015 08:15 AM
If you are using the Pig action, as opposed to a shell action, then your Pig job should run in a distributed fashion.
Here is a great blog post (recently updated for CDH 5) that shows a step by step example for running a Pig script as an Oozie Pig action:
https://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/
Created 11-05-2015 09:43 AM
Hello,
Thank you for your answer.
The problem is that clasic Pig scripts (no access to Hive tables, nor to HBase) are running in a distributed way (they have mappers and reducers).
However, this one is running only on one node
(in Cloudera Manager ->Hosts all namenodes have a Load Average of 0.* and one node has 9.* as load charge)
Since you say that normally, even if only mappers are created the script should run in a distributed node, I will post an anonymised version of my script.
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false; SET output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; SET hbase.zookeeper.quorum '${ZOOKEEPER_QUORUM}'; SET oozie.use.system.libpath true SET oozie.libpath '${PATH_LIB_OOZIE}' ------------------------------------------------------------ -- hcat register 'hive-hcatalog-core-0.13.1-cdh5.3.0.jar'; register 'hive-hcatalog-core.jar'; register 'hive-hcatalog-pig-adapter-0.13.1-cdh5.3.0.jar'; register 'hive-hcatalog-pig-adapter.jar'; register 'hive-metastore-0.13.1-cdh5.3.0.jar'; register 'datanucleus-core-3.2.10.jar'; register 'datanucleus-api-jdo-3.2.6.jar'; register 'datanucleus-rdbms-3.2.9.jar'; register 'commons-dbcp-1.4.jar'; register 'commons-pool-1.5.4.jar'; register 'jdo-api-3.0.1.jar'; -- UDF REGISTER 'MyStoreUDF-0.3.8.jar'; ------------------------------------------------------------------------------------------------------------ ----------------------------------------------- input data ------------------------------------------------- var_a= LOAD 'my_database.my_table' USING org.apache.hcatalog.pig.HCatLoader() as ( a:chararray , b:chararray, c:chararray, d:chararray, e:chararray, f:long, g:chararray, h:chararray, i:long, j:chararray, k:bag{((name:chararray,value:chararray))}, l:chararray, m:chararray ); var_a_filtered= FILTER sessions BY (a== 'abcd' ); var_a_proj= FOREACH var_a_filteredGENERATE a, b, c, d; STORE var_a_proj INTO 'hbaseTableName' USING MyStoreUDF('-hbaseTableName1 hbaseTableName1 -hbaseTableName2 -hbaseTableName2 ');
Thank you!
Alina GHERMAN
Created 02-05-2016 04:49 AM
I found out why this was happening.
Since I was on a DEV cluster I stopped and started the services every day.
Also, the data from the table to which I was writing was moving to a single machine from time to time (due to service failing, start and stop etc.)
After I balanced the HBase Table the script was distributed.