Created 09-29-2015 12:38 PM
Created 11-26-2015 11:44 PM
What about Hive transform calling your python code? You can use whatever input format you want in your hive table.
A simple code here:
http://andreyfradkin.com/posts/2013/06/15/combining-hive-and-python/
a more detailed example (but using R instead of python):
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
but you can also use map/reduce paradigm if you use distribute by clauses.
Created 09-29-2015 01:42 PM
In this scenario your best bet is going to be to use MR-Streaming. MR-Streaming will read the data from the files in HDFS and present each InputRecord (I'm assuming TextInputFormat line delimited so each line of the file in that case) to your python script to execute. This is handy in your scenario because this keeps you from having to invoke python scripts from the native Java MR code. Here is a really simple example. You can adjust the 'map.py' file to contain any logic you desire or even use subprocess to call an existing python script it desired.
Created 10-06-2015 12:31 PM
Thanks for the response but, unfortunately, this won't work because we need to wrap a Hive table over the source files.
Created 11-26-2015 11:44 PM
What about Hive transform calling your python code? You can use whatever input format you want in your hive table.
A simple code here:
http://andreyfradkin.com/posts/2013/06/15/combining-hive-and-python/
a more detailed example (but using R instead of python):
http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/
but you can also use map/reduce paradigm if you use distribute by clauses.
Created 02-03-2016 03:52 PM
@Scott Shaw are you still having issues with this? Can you accept best answer or provide your own solution?