Support Questions

Find answers, ask questions, and share your expertise

Running Python Scripts on data in HDFS

avatar
Contributor

Hi,

I am using the Sandbox on a VM within the Azure Cloud, I am neither a developer nor a scientist so excuse my ignorance.

I have loaded my csv file to HDFS and I am currently trying to get it into a Hive table so that I can run some queries against it.

Using MapReduce I need to run a TF-IDF algoritm on a number of entries in the table. I believe I need to write the TF-IDF algorithm in Python but I am unsure how I go about this. Do I need to install a Python compiler on the Azure VM or can I write the code locally on my laptop and query the Hive table?

1 ACCEPTED SOLUTION

avatar
Contributor

You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221

View solution in original post

5 REPLIES 5

avatar

I'm surely not going to give you the best answer on this one, but "Hadoop Streaming", as described at http://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html, is a way to run a MapReduce job that executes your Python code. In this case, you'll need to have Python installed on all the cluster nodes, but since you're starting out on the Sandbox that makes it easy (just one place!).

Yes, you could also run a Python app that queries Hive, but only that query itself will be running in the cluster. In this case, you'll obviously just need Python wherever you are running it from.

avatar
Master Mentor

I have an example from a book by Ofer and Casey from Hortonworks implementing TF-IDF using pyspark (python and Spark). It does need more work but it can put you on the right path. https://github.com/dbist/datamunging

You surely picked a pretty sophisticated usecases to start learning Hadoop.

avatar

In case you haven't already created your Hive table, this will help you do so.

Assuming your cluster is running in Linux VMs, Python is already installed. I can't comment on tf-idf, but the below should help you understand a generic approach to integrating Python functions and Hive queries.

Once you have a Hive table, running Python scripts against records is straightforward. You'll need to ssh into your "edge" node, one that has the Hive CLI installed. To start it, type "hive" at the command prompt.

hive> add file my_script.py;

hive> select transform(col1, col2) as result1, result2 using 'my_script.py' from my_table;

See the transform docs, but essentially this will run the equivalent of an MR streaming job against every record in my_table. Records are passed to your script delimited by newlines, and fields are delimited by tabs. Anything you print to standard out will be interpreted the same way (print one line per output record, fields separated by tabs).

Here's an example that uses a Python script to validate datatypes. Feel free to ping back if you need additional help.

avatar
Super Collaborator

I'd say (in general) whenever you need to parallelize your algorithm, and i suppose TF-IDF is a good candidate for it, you need to submit this job to the cluster in any way.

It can be a streaming mentioned by @Lester Martin, or Pyspark mentioned by @Artem Ervits (just note - spark is not map-reduce, so if you want to learn map-reduce first, then streaming option is the best for you).

And in case you have some lite algorithm to implement and it can be done on client machine/your laptop/application server etc - you can just submit to Hadoop cluster some Hive query and process the results locally then.

avatar
Contributor

You've mentioned Python to implement TF-IDF, but unless you absolutely have to use Python for some other reason, then you can consider implementing the same algorithm using Hive SQL instead. That way, it'll run in parallel without any extra work. Take a look at the Wikipedia article on TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

Here's one sample SQL implementation of TF-IDF that you could build Hive SQL from by ignoring all the index related stuff : https://gist.github.com/sumanthprabhu/8067221