12-25-2013 05:21 AM
We are trying to Invoke external webservice in Map Reduce . It work for a while and after certain time it blocks whole M/R jobs. Our Requirement is
1) With in Map Reduce we are running webservices based on webservice result we need insert specefic data into HBASE table
2) If we comment out our webservice, everything is working
3) Webservice key part of business logic
4) Also we are calling webservice in mapper
What are best pratices to invoke external web services in M/R Jobs? Can we set a parameter so that only one data node can invoke webservice. What is the solution. Any help is greatly appreciated.
12-25-2013 07:03 AM
Calling an external web-service in M/R is generally a bad idea. You may have hundreds of mappers, and most web services can't handle this many concurrent connections.
You already realized that you only want one node to connect to the web service. Thats excellent.
Since its only one node, it means that it doesn't have to be part of the map-reduce job.
Write a small java app that will connect to web-service, get the data it wants and place it on HDFS.
Run the app before starting the MR job. You can run it from any node, since the data will end up on HDFS.
Then your map-reduce job can read the data from HDFS. Much more scalable and the way Hadoop is meant to work.
I would recommend using Oozie to tie the jobs together. Use Java Action for the web-service connection part, and then MR action for your map-reduce job. Oozie will make it easy to skip the MR job if the web-service step failed.
12-25-2013 11:12 AM
The problem is we are using oozie and through java action, we thought of doing. But M/R data need to pass to this java class and can we accomplish this through oozie.
MR-out-a variabile-> that variable-pass to java class which invokes webservice -- can I complish this through oozie.
12-25-2013 05:06 PM
Mmm.... passing a variable from java action to MR action is easy, you use capture-output for that (http://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/)
But MR jobs don't really output variables. They are for reading lots of data and writing lots of data out. With many possible reducers, its not clear even which output to capture. If you need MR to pass data to a Java action, I'd have MR include it in the output its writing to HDFS and have Java action read this output.
But I'd also double-check if you really need MR for whatever you are doing that generates this variable.