Support Questions

Find answers, ask questions, and share your expertise

Does anyone have example code of how to use the MR input format to call a python script?

avatar
 
1 ACCEPTED SOLUTION

avatar
@Scott Shaw

What about Hive transform calling your python code? You can use whatever input format you want in your hive table.

A simple code here:

http://andreyfradkin.com/posts/2013/06/15/combining-hive-and-python/

a more detailed example (but using R instead of python):

http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/

but you can also use map/reduce paradigm if you use distribute by clauses.

View solution in original post

4 REPLIES 4

avatar
Guru
@sshaw@hortonworks.com

In this scenario your best bet is going to be to use MR-Streaming. MR-Streaming will read the data from the files in HDFS and present each InputRecord (I'm assuming TextInputFormat line delimited so each line of the file in that case) to your python script to execute. This is handy in your scenario because this keeps you from having to invoke python scripts from the native Java MR code. Here is a really simple example. You can adjust the 'map.py' file to contain any logic you desire or even use subprocess to call an existing python script it desired.

avatar

Thanks for the response but, unfortunately, this won't work because we need to wrap a Hive table over the source files.

avatar
@Scott Shaw

What about Hive transform calling your python code? You can use whatever input format you want in your hive table.

A simple code here:

http://andreyfradkin.com/posts/2013/06/15/combining-hive-and-python/

a more detailed example (but using R instead of python):

http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/

but you can also use map/reduce paradigm if you use distribute by clauses.

avatar
Master Mentor

@Scott Shaw are you still having issues with this? Can you accept best answer or provide your own solution?