Created on 05-28-2015 04:41 PM - edited 09-16-2022 02:30 AM
I'm using Java mapreduce job to write data to a directory which will be interpreted as a Hive table in RCFile format.
In order to do this, I need to include org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable object,
which can be found in hive-serde-0.13.1-cdh5.3.3.jar. So far, so good.
I've included the jar in my command line like this:
/usr/bin/hadoop jar /path/lib/awesome-mapred-0.9.6.jar com.awesome.HiveLoadController -libjars /path/lib/postgresql-8.4-702.jdbc4.jar,/path/lib/hive-serde-0.13.1-cdh5.3.3.jar
I know for certain that it is loading the postgres library because it prints correctly retrieved information before it throws the error.
I know that it is grabbing and transferring that jar file because it throws a fit if I move it from the /path/lib directory.
I know that the object exists in the jar because I've unpacked it and looked.
Is there something in the rest of the lib path that might be interfering with it finding that object in the jar?
Created 07-06-2015 10:15 PM
Thank you for the additional details!
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
This indicates a problem in the driver-end, or as you say 'during the execution of the job controller'.
The issue is that even if you do add something to the MR distributed cache classpath, your executor class also references the same class. The act of adding a jar to the distributed task's classpath does not also add it to the local one.
Here's how you can ensure that, if you use 'hadoop jar' to execute your job:
~> export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar ~> hadoop jar your-app.jar your.main.Class [arguments]
This will add it also to your local JVM classpath, while your code will further add it onto the remote execution classpaths.
> Optimally, I shouldn't have to stuff this one in the distributed cache since it sits in /opt/cloudera/parcels/CDH-5.3.5-1.cdh5.3.5.p0.4/jars/hive-exec-0.13.1-cdh5.3.5.jar on all of my slave nodes, but I also can't figure out how to tell MapReduce to look there.
MR remote execution classpath is governed by the classpath entry defined in the mapred-site.xml and yarn-site.xml, and the additonal elements you add to the DistributedCache. They do not use the entire /opt/cloudera/parcels/CDH/jars/* path - this is so for isolation and flexibility purposes, as that area may carry multiple versions of the same dependencies, etc.
Does this help?
Created 06-24-2015 02:39 PM
Created on 07-06-2015 03:36 PM - edited 07-06-2015 03:40 PM
This error was called during the execution of the job controller within the MapReduce job. Here's a similar one with the same root problem.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/io/orc/OrcNewOutputFormat at com.who.bgt.logloader.schema.OrcFileLoader.run(OrcFileLoader.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at com.who.bgt.logloader.schema.OrcFileLoader.main(OrcFileLoader.java:45) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.orc.OrcNewOutputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 8 more
The specific line it is complaining about is here:
job.setOutputFormatClass(OrcNewOutputFormat.class);
The obvious problem is that it's failing to find the OrcNewOutputFormat class definition, which is in hive-exec-0.13.1-cdh5.3.5.jar
I pushed the jar to hdfs://lib/hive-exec..., and within my main function, I call the following before I run the job:
DistributedCache.addFileToClassPath(new Path("/lib/hive-exec-0.13.1-cdh5.3.5.jar"), lConfig);
Can you be more explicit on how I go about making sure my distributed-cache configs work?
Optimally, I shouldn't have to stuff this one in the distributed cache since it sits in /opt/cloudera/parcels/CDH-5.3.5-1.cdh5.3.5.p0.4/jars/hive-exec-0.13.1-cdh5.3.5.jar on all of my slave nodes, but I also can't figure out how to tell MapReduce to look there.
Created 07-06-2015 10:15 PM
Thank you for the additional details!
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
This indicates a problem in the driver-end, or as you say 'during the execution of the job controller'.
The issue is that even if you do add something to the MR distributed cache classpath, your executor class also references the same class. The act of adding a jar to the distributed task's classpath does not also add it to the local one.
Here's how you can ensure that, if you use 'hadoop jar' to execute your job:
~> export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar ~> hadoop jar your-app.jar your.main.Class [arguments]
This will add it also to your local JVM classpath, while your code will further add it onto the remote execution classpaths.
> Optimally, I shouldn't have to stuff this one in the distributed cache since it sits in /opt/cloudera/parcels/CDH-5.3.5-1.cdh5.3.5.p0.4/jars/hive-exec-0.13.1-cdh5.3.5.jar on all of my slave nodes, but I also can't figure out how to tell MapReduce to look there.
MR remote execution classpath is governed by the classpath entry defined in the mapred-site.xml and yarn-site.xml, and the additonal elements you add to the DistributedCache. They do not use the entire /opt/cloudera/parcels/CDH/jars/* path - this is so for isolation and flexibility purposes, as that area may carry multiple versions of the same dependencies, etc.
Does this help?
Created 09-07-2017 11:24 PM
Hi,
I'm facing a similar issue with RCFileInputFormat.
Im executing a simple code to read from a RCFile in mapper (usind RCFileInputFormat) and doing an aggregation on reducer side.
A able to compile the code. But, while running facing ClassNotFoundException for Class org.apache.hadoop.hive.ql.io.RCFileInputFormat.
Tried adding the jar in hadoop classpath but no luck.
The below is the StackTrace.
--> hadoop jar MRJobRCFile.jar MRJobRCFile /apps/hive/warehouse/7360_0609_rx/day=06-09-2017/hour=13/quarter=2/ /test_9
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.RCFileInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1649)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.ql.io.RCFileInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1617)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java
Should I investigate from jobconf.xml . If so, what do i need to check ?
Created on 09-08-2017 01:16 AM - edited 09-08-2017 01:17 AM
I was able to make the job run by adding hive-exec jar in HADOOP_CLASSPATH as well as adding the jar in distributed cache.
Can you throw some light as to why do we need to export the jar to classpath and also add in distributed cache.