01-08-2016 03:26 PM
I writing a python UDF that uses data from a file located on the HDFS.
If I copy the file to the file system on the Cluser node, I will be able to use it in the following way:
add file /home/user/pythonUDFs/my_udf.py;
add file /home/user/data/some_data_needed_in_my_udf.tsv ;
But I would like not to copy the "some_data_needed_in_my_udf.tsv" to the local Hadoop node file system and read it directly from the HDFS.
How can I do that?
01-13-2016 04:07 PM
You can use a 3rd party Python package that talks to HDFS via WebHDFS (https://pypi.python.org/pypi/pywebhdfs for example) and import and use that within your UDF. The package must be sent along, or be installed across the cluster for the import to succeed.
Note however, that the Python UDFs have no context (such as a Configuration object access possible by some methods in Java UDFs). Thereby, they will have no access to security tokens used to talk to HDFS from task containers across the clusters. If you have a secure cluster, this approach would fail unless you find a way to pass tokens or re-authenticate via a keytab from within the UDF code, prior to the use of WebHDFS.
IMHO, it is better to continue your current ADD FILE approach since that would work with secure clusters also, without requiring any change.