Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Trouble with classes while using s3a filesystem in oozie coordinator input dataset

New Contributor

For our hadoop clusters ( versions 2.6.0 and 2.7.1) we are using s3 buckets as the source of a number of input files. We have been successful at creating an oozie workflow with a distcp action, putting the data from the bucket to where we want it on the cluster. The filesystem we chose to do this is s3a (we expect large files).

Now we want to schedule this workflow, using a location on the bucket as the uri for the input dataset. Below is an example of the dataset.

		<dataset name="s3bucket" initial-instance="${coorStartTime}" timezone="Europe/Amsterdam" frequency="15">

Since this is a coordinator dataset and no oozie workflow-action has been started yet, the hadoop-aws<version>.jar in the share lib is not picked up. As a result we get an error saying that the class org.apache.hadoop.fs.s3a.S3AFileSystem is missing.

I tried adding the hadoop-aws.jar to <path-to-oozie>/libext and then do

./ prepare-war

That gave us an error when starting the oozie service back up. (the same on both clusters)

(directly placing the jar in /libserver got me the same error)

Note: without the added hadoop-aws.jar this particular part of the oozie.log shows all the share lib jars it can find, instead of the error.

2016-01-27 12:33:29,948 ERROR ShareLibService:540 - SERVER[host] USER[-] GROUP[-] org.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with and issue the 'oozie admin' CLI command to update the sharelib
org.apache.oozie.service.ServiceException: E0104: Could not fully initialize service [org.apache.oozie.service.ShareLibService], Not able to cache sharelib. An Admin needs to install the sharelib with and issue the 'oozie admin' CLI command to update the sharelib
	at org.apache.oozie.service.ShareLibService.init(
	at org.apache.oozie.service.Services.setServiceInternal(
	at org.apache.oozie.service.Services.setService(
	at org.apache.oozie.service.Services.loadServices(
	at org.apache.oozie.service.Services.init(
	at org.apache.oozie.servlet.ServicesLoader.contextInitialized(
	at org.apache.catalina.core.StandardContext.listenerStart(
	at org.apache.catalina.core.StandardContext.start(
	at org.apache.catalina.core.ContainerBase.addChildInternal(
	at org.apache.catalina.core.ContainerBase.addChild(
	at org.apache.catalina.core.StandardHost.addChild(
	at org.apache.catalina.startup.HostConfig.deployDescriptor(
	at org.apache.catalina.startup.HostConfig.deployDescriptors(
	at org.apache.catalina.startup.HostConfig.deployApps(
	at org.apache.catalina.startup.HostConfig.start(
	at org.apache.catalina.startup.HostConfig.lifecycleEvent(
	at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
	at org.apache.catalina.core.ContainerBase.start(
	at org.apache.catalina.core.StandardHost.start(
	at org.apache.catalina.core.ContainerBase.start(
	at org.apache.catalina.core.StandardEngine.start(
	at org.apache.catalina.core.StandardService.start(
	at org.apache.catalina.core.StandardServer.start(
	at org.apache.catalina.startup.Catalina.start(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.catalina.startup.Bootstrap.start(
	at org.apache.catalina.startup.Bootstrap.main(
Caused by: java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
	at java.util.ServiceLoader.access$100(
	at java.util.ServiceLoader$
	at java.util.ServiceLoader$
	at org.apache.hadoop.fs.FileSystem.loadFileSystems(
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
	at org.apache.hadoop.fs.FileSystem.createFileSystem(
	at org.apache.hadoop.fs.FileSystem.access$200(
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
	at org.apache.hadoop.fs.FileSystem$Cache.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at org.apache.hadoop.fs.FileSystem.get(
	at org.apache.oozie.service.ShareLibService.init(
	... 29 more
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/AmazonS3
	at java.lang.Class.getDeclaredConstructors0(Native Method)
	at java.lang.Class.privateGetDeclaredConstructors(
	at java.lang.Class.getConstructor0(
	at java.lang.Class.newInstance(
	at java.util.ServiceLoader$
	... 39 more
Caused by: java.lang.ClassNotFoundException:
	at org.apache.catalina.loader.WebappClassLoader.loadClass(
	at org.apache.catalina.loader.WebappClassLoader.loadClass(
	... 44 more

Other classes seems to be missing.But... The error talks about the share lib, but I changed nothing there and it works fine with the workflow My question is: why did my adding a library to oozie /libext upset the share lib? And above all how can I get the coordinator to make use of the s3a filesystem class while trying to reach the bucket?

Edited after comment by Artem Ervits

The workflow I am talking about is a regular workflow-app, using a distcp action to move .txt files from s3 to a folder on the cluster.

<workflow-app xmlns="uri:oozie:workflow:0.5" name="100-load-full">


	<start to="start-distcp"/>

	<action name="start-distcp">
         <distcp xmlns="uri:oozie:distcp-action:0.2">
            	<mkdir path="${nameNode}${dropzoneMgm}/${loadDate}"/>
        <ok to="create-success"/>
        <error to="fail"/>
	<action name="fail">
        <email xmlns="uri:oozie:email-action:0.1">
            <body>De workflow is met fout(en) beeindigd bij tabel ${tableName}</body>
        <ok to="fail-end"/>
        <error to="fail-end"/>

	<action name="create-success">
			<touchz path="${nameNode}${dropzoneMgm}/${loadDate}/_SUCCESS"/>
		<ok to="end"/>
		<error to="fail"/>

	<kill name="fail-end">
		<message>job failed</message>
	<end name="end"/>

Starting this workflow by using a coordinator which polls for a dataset works when this uri is on the cluster itself (hdfs uri).

For example using the dataset:

<dataset name="s3bucket" initial-instance="${coorStartTime}" timezone="Europe/Amsterdam" frequency="15">

This coordinator - workflow combination does what we expect it to. It copies everything in s3a://my-favourite-s3-bucket/ to ${nameNode}${dropzoneMgm}/${loadDate} if _TEST exists in ${nameNode}/dropzone

When I change the filesystem of the uri to a path on the bucket, then it goes wrong.

By the way we did set the property oozie.service.HadoopAccessorService.supported.filesystems to hdfs,s3a in oozie-site (because the standard was just hdfs)



@Esther Schreuders if I understand your problem, you are modifying coordinator wf. I usually leave coordinator wf alone and make sure regular workflow works. Once it works I place the whole wf directory within coordinator wf directory, modify schedule and launch. Try a sample coordinator before complicating your wf with S3 and confirm that works. Them try my suggested approach.

New Contributor

Thanks for the reply, I already tested whether the workflow and coordinator worked without the s3 dataset. I added some information about those to the question.


Have you looked at Falcon, it has S3 support and works on top of Oozie Link @bsaini @Esther Schreuders