Reply
Explorer
Posts: 23
Registered: ‎02-07-2015

Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)

(There's no question in here: After a lost afternoon managed to get it working and find it useful to document it here for anyone else who is interested in using or even better to follow up with some corrections on source code...)

 

My scenario involves running some processes in EC2, which store data into S3 once ready (writing also a _SUCCESS file)

My Hadoop cluster running outside AWS, should load the results of such computation into Hive.

 

I could manage distcp working early on, but wasn't happy with the fact that I had to rely on assumption of completion
rather than using Coordinator's own datasets file flag, to wait for job completition.

 

Article https://issues.apache.org/jira/browse/OOZIE-426 was the starting point.

 

Start by making sure that you add

 

<property>
<name>oozie.service.HadoopAccessorService.supported.filesystems</name>
<value>*</value>
</property>

 

to your oozie-site.xml configuration. By default only hsdf is supported.

 

My coordinator.xml has something like:

 

<datasets>
<dataset name="ix" frequency="${coord:days(1)}" initial-instance="2015-05-10T15:25Z" timezone="Europe/Zurich">
<uri-template>s3n://mybucket/a/b/${YEAR}/${MONTH}/${DAY}</uri-template>
</dataset>
</datasets>


If you try it now, you'll see an error stating that your bucket is not an whitelisted name node server !
This is for me the first error somewhere in the code as it doesn't recognize s3 do not have such. To fool Oozie I add
to oozie-site.xml:

 

<property>
<name>oozie.service.HadoopAccessorService.nameNode.whitelist</name>
<value>__your_name_node_server_here_,mybucket</value>
</property>

 

Having solved this, the next hurdle comes in the form of an error:

 

java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException is Oozie logs.

 

In the end this ends up being an inconsistency in the versions of jets3t libraries between hadoop and oozie.

 

You can solve this by updating the version of jets3t jar to something like 0.9.x.

Download the last version from http://jets3t.s3.amazonaws.com/downloads.html.

 

Spent a lot of time trying to get it working with sharelib, but never managed to do it. Not sure if it should work...

 

Ended up updating the version of jar on Oozie's libserver own directory. Backup the soft link of 0.6 jets3t jar,
and copy here the jets3t 0.9.x version.

Restart Oozie.

 

Now things work!

 

In my case I transfer ${coord:dataIn('input')} directly to Hive in a creates external table....location <___>. After which I do a CTAS to bring the data in.

Notice that the load time in Hive is of course longer than if you go through distcp....

 

It shouldn't take so much work to configure such an useful setup...

Posts: 607
Kudos: 68
Solutions: 35
Registered: ‎04-06-2015

Re: Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)

Thank you for sharing the information balta.

 




Cy Jervis, Community Manager - I'm not an expert but will supply relevant content from time to time. :)

Learn more about the Cloudera Community:


Terms of Service


Community Guidelines


How to use the forum

New Contributor
Posts: 1
Registered: ‎10-26-2015

Re: Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)

Hi,

has there been any progress on this issue? Will be proposed changes be
applied to future Cloudera or Apache versions of Oozie?

cheers
Andre

New Contributor
Posts: 3
Registered: ‎11-28-2015

Re: Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)

Hi balta. that post was pretty helpful for me. But when i try to access el functions for s3 in oozie they aren't working. please let me know hw to overcome this. For instance when i use fs:exists(s3://somelocation). its not working. How to acheive this. Please Help
Highlighted
New Contributor
Posts: 1
Registered: ‎06-28-2017

Re: Getting Oozie Coordinator datasets working with S3 (after a lost afternoon)

Hi Balta,

 

This post has saved lot of time for me thanks alot for this post,

 

I am facing following error:

 

Coord Action Input Check Error: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

1 ) but the thing is our admin polices will not allow to add this in core-site.xml

 

2) my data set as follows:

 

<datasets>
<dataset name="input1" frequency="30" initial-instance="${DATA_SET_START_DATE}" timezone="UTC">

<uri-template>s3n://useast1-nlsn-w-digital-enghar-dcr-crediting-dev01/dcr_outputfiles_135/dataset_triggers/eventscensus/hourly/2017/06/21/21/00/</uri-template>

<done-flag>dataset_complete.dat</done-flag>
</dataset>

 

could please suggest how can i pass acess & secet keys to my <URI>

 

Note: i tried with s3n://acesskey:secretkey@useast1-nlsn-w-digital-enghar-dcr-crediting-dev01/dcr_outputfiles_135/dataset_triggers/eventscensus/hourly/2017/06/21/21/00/

 

but no luck

 

Announcements