Support Questions
Find answers, ask questions, and share your expertise

Who agreed with this topic

Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)


(There's no question in here: After a lost afternoon managed to get it working and find it useful to document it here for anyone else who is interested in using or even better to follow up with some corrections on source code...)


My scenario involves running some processes in EC2, which store data into S3 once ready (writing also a _SUCCESS file)

My Hadoop cluster running outside AWS, should load the results of such computation into Hive.


I could manage distcp working early on, but wasn't happy with the fact that I had to rely on assumption of completion
rather than using Coordinator's own datasets file flag, to wait for job completition.


Article was the starting point.


Start by making sure that you add




to your oozie-site.xml configuration. By default only hsdf is supported.


My coordinator.xml has something like:


<dataset name="ix" frequency="${coord:days(1)}" initial-instance="2015-05-10T15:25Z" timezone="Europe/Zurich">

If you try it now, you'll see an error stating that your bucket is not an whitelisted name node server !
This is for me the first error somewhere in the code as it doesn't recognize s3 do not have such. To fool Oozie I add
to oozie-site.xml:




Having solved this, the next hurdle comes in the form of an error:


java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException is Oozie logs.


In the end this ends up being an inconsistency in the versions of jets3t libraries between hadoop and oozie.


You can solve this by updating the version of jets3t jar to something like 0.9.x.

Download the last version from


Spent a lot of time trying to get it working with sharelib, but never managed to do it. Not sure if it should work...


Ended up updating the version of jar on Oozie's libserver own directory. Backup the soft link of 0.6 jets3t jar,
and copy here the jets3t 0.9.x version.

Restart Oozie.


Now things work!


In my case I transfer ${coord:dataIn('input')} directly to Hive in a creates external table....location <___>. After which I do a CTAS to bring the data in.

Notice that the load time in Hive is of course longer than if you go through distcp....


It shouldn't take so much work to configure such an useful setup...

Who agreed with this topic