Created 01-01-2014 10:29 AM
I would like to write mapreduce code – ideally using python – on my apple mac to streaming it on the QuickStart VM.
Ideally my development setup is using my Apple Mac python environment & the QuickStart VM (later to be expanded to a cluster).
While there are many description on how to connect or stream code from within a node of the hadoop cluster or sandbox (e.g. from the NameNode etc.), I am unclear on what to do to connect just as a client.
E.g. I assume I need to install some hadoop client libraries on my OsX to talk to the Sandbox HDFS? Where do I get these libraries from?
How do I install them?
What type of python package works best?
What IP address should I use to stream my python code?
Any help – and any link to a tutorial covering this – would be great!
Created 01-02-2014 09:40 AM
Created 01-02-2014 09:57 AM
I do not intend to install hadoop on OSX. I just would like to install the client libraries that - as I understand - are needed with certain packages etc.
My idea would be to write; test and debug the code on the mac, to then execute it on the VM, ideally launching it from OSX.
As far as python, I refer to libraries like MrJobs or pydoop and similar.
Created 01-02-2014 11:02 AM
I believe Manikumar is correct and you will have to install Hadoop on your Mac OS in order to be able to execute client application code that connects to the QuickStart VM. If you prefer not to do this, you could easily create a second CentOS virtual machine and add it to your Quickstart VM as a gateway machine, which will set up all the necessary environmental properties to execute your code from there.
Created on 01-03-2014 02:23 AM - edited 01-03-2014 02:26 AM
Why would be necessary installing *all* of hadoop on the client?
My understanding is that intstalling these client files it all what I need on the client side (e.g. my apple mac)?
For example, Cloudera Manager provides client files - I assume for this use cases only?
Created 01-03-2014 08:16 AM
Created 01-08-2014 11:32 AM
This is helpful. To be clear I do not have issues in installing Client SW on the Apple. I just do not want to use it as a cluster node.
If I understand Cloudera terminology, just as a Gateway. I'm looking for WebHDFS (thanks for pointing that to me!) and looks great. The only issue so far seems a good tool to to "things" in the cluster (like create directories / files etc.), but I haven't seen any example of using WebHDFS to launching a .jar file with code...
I've also started to research MrJobs that looks quite promising.
I wonder if anybody has used MrJobs from a Gateway-type (client-type) node and Cloudera...