Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Why does Oozie pick up scripts and librairies from HDFS

Hello,

My company and I are new users of hortonworks. We plan to use oozie as scheduler but for most of us, we found strange to put scripts and librairies into the HDFS. The main role of the HDFS isn't to store data and only data ?

As oozie's clients can be installed on several nodes of the cluster, I think it's pertinent to get these librairies somewhere accessible by all nodes. So the HDFS is in fact the best place (and files are replicated).

Can someone tell me why oozie made this choice ?

Greathfully

Mathieu

1 ACCEPTED SOLUTION



1. Oozie has its own database where it maintain the cordinator info, (time to trigger , workflow path, lib path etc ), bundle id .
2. Oozie client spawns a map only job job when the time has arrived for the coordinator to run, this map only job is known as the launcher job( This Map only job will also have application master)
3. The Map job will launch the job thats mentioned in the workflow action. Now somehow this Map job needs access to the libs needed for to spawn the job mentioned in the workflow( map reduce, spark , java, shell )
4. oozie mapper job can be spawned anywhere in the cluster , as its a mapper.
5. HDFS is the most suitable place where any mapper launcher anywhere in the cluster, can have access too.
6. It also provides a flexibilty to update the user jar by the user in HDFS , and not to inform oozie about the change in lib jar


View solution in original post

2 REPLIES 2



1. Oozie has its own database where it maintain the cordinator info, (time to trigger , workflow path, lib path etc ), bundle id .
2. Oozie client spawns a map only job job when the time has arrived for the coordinator to run, this map only job is known as the launcher job( This Map only job will also have application master)
3. The Map job will launch the job thats mentioned in the workflow action. Now somehow this Map job needs access to the libs needed for to spawn the job mentioned in the workflow( map reduce, spark , java, shell )
4. oozie mapper job can be spawned anywhere in the cluster , as its a mapper.
5. HDFS is the most suitable place where any mapper launcher anywhere in the cluster, can have access too.
6. It also provides a flexibilty to update the user jar by the user in HDFS , and not to inform oozie about the change in lib jar


Thank's a lot for your response @kgautam.

That's what I thought but now I have arguments to convince.