Created on 08-13-2013 06:55 AM - edited 08-13-2013 07:04 AM
My input path in s3n,
s3n://xxx-xxx/20130813/08
My oozie configuration show as ,
Created 08-28-2013 04:09 AM
@dvohra wrote:This isn't true. Depending on what you're doing with Oozie, S3 is supported just fine as an input or output location.
Doesn't the coordinator expect the input path to be on HDFS as hdfs://{nameNode} is prepended automatically? The workflow.xml is on the HDFS? Isn't the workflow.xml required to be on the HDFS?
Yes unfortunately coordinators currently poll inputs over HDFS alone, which is a limitation. However, writing simple WF actions to work over S3 is still possible.
Yes, WFs should reside on HDFS, as Oozie views it as its central DFS. Similar to how MR requires a proper DFS to run. But this shouldn't impair simple I/O operations done over an external FS such as S3.
I think Romain has covered the relevant JIRAs for tracking removal of this limitation.
Created 08-13-2013 07:06 AM
The link hdfs://xxx.internal:8020/s3n://xxx-xxx/20130813/08 requires a login.
Created 08-13-2013 07:33 AM
Sorry, my question is through Hue in cloudera manager i'm running the oozie job .And I can able to access the hdfs,my question is to connect the another instance Amazon as s3n://xxx to connect ..
Created 08-13-2013 09:19 AM
The input path is required to be to HDFS, not S3. S3 is not the same as HDFS.
Created 08-18-2013 01:39 PM
@dvohra wrote:The input path is required to be to HDFS, not S3. S3 is not the same as HDFS.
This isn't true. Depending on what you're doing with Oozie, S3 is supported just fine as an input or output location.
Created 08-19-2013 03:13 PM
This isn't true. Depending on what you're doing with Oozie, S3 is supported just fine as an input or output location.
Doesn't the coordinator expect the input path to be on HDFS as hdfs://{nameNode} is prepended automatically? The workflow.xml is on the HDFS? Isn't the workflow.xml required to be on the HDFS?
Created 08-28-2013 04:09 AM
@dvohra wrote:This isn't true. Depending on what you're doing with Oozie, S3 is supported just fine as an input or output location.
Doesn't the coordinator expect the input path to be on HDFS as hdfs://{nameNode} is prepended automatically? The workflow.xml is on the HDFS? Isn't the workflow.xml required to be on the HDFS?
Yes unfortunately coordinators currently poll inputs over HDFS alone, which is a limitation. However, writing simple WF actions to work over S3 is still possible.
Yes, WFs should reside on HDFS, as Oozie views it as its central DFS. Similar to how MR requires a proper DFS to run. But this shouldn't impair simple I/O operations done over an external FS such as S3.
I think Romain has covered the relevant JIRAs for tracking removal of this limitation.
Created 01-20-2014 09:05 AM
Created 01-20-2014 09:10 AM
Created 01-20-2014 01:54 PM
Thank you. However I think my question is a little different than that addresses. I am trying to specify an input or output directory in form "s3://..." in an Oozie workflow itself (as an input to a hadoop map reduce job). Do you know if this should work? I get an error that says the path can't have "s3" in it.
Created 01-21-2014 05:09 PM
Created 01-23-2014 07:31 AM
Created 01-23-2014 07:44 AM
Created 01-24-2014 01:58 PM
Created 08-18-2013 01:41 PM
@Ashok wrote:My input path in s3n,
s3n://xxx-xxx/20130813/08
My oozie configuration show as ,
hdfs://xxx.internal:8020/s3n://xxx-xxx/20130813/08
Can you share your workflow.xml for us to validate?
If you're passing an S3 input or output path, simply ensure your workflow does not template it as ${nameNode}/${input} or something like that. That way you're prepending a HDFS URI to your already-an-uri path. This could most likely be your issue.
Created 08-18-2013 11:14 PM
In coordinator jobs i'm passing the dataset uri template as
s3n://xxx-xxx/${YEAR}${MONTH}${DAY}/${HOUR}
and coord:dataOut as
<property>
<name>in_folder</name>
<value>${coord:dataOut('in_folder')}</value>
</property>
and my workflow.xml input as
${in_folder}
when I submit the coordinator job it automatically preappend the configuration like:
${nameNode}s3n://xxx-xxx/${YEAR}${MONTH}${DAY}/${HOUR}
Created 08-18-2013 11:57 PM
Good to know, Hue Coodinators are prepended only with hdfs.
Is https://issues.apache.org/jira/browse/OOZIE-426 finished?
Created 08-19-2013 01:46 AM
FWIW, the same job works fine as a workflow when submitted via Hue. In this case, we manually pass the input (S3) and output (hdfs) locations and the job runs successfuly - thus establishing that the problem is not with S3 support. The problem is when we let the co-ordinator pass this input (via a computed datasource) does it automatically prepend hdfs://{nameNode} in front of the s3n://<> URI. Hope this clarifies.
Created 08-19-2013 10:28 AM
Ok this clarifies a lot! I updated https://issues.cloudera.org/browse/HUE-1501.
Created on 08-19-2013 09:21 PM - edited 08-19-2013 09:23 PM
Thanks. Is this considered as a bug? If yes, what are some workarounds that we can follow for now? Any help is appreciated.