Reply
New Contributor
Posts: 4
Registered: ‎01-07-2016

Read HDFS files from Python UDF

Hi,

 

I writing a python UDF that uses data from a file located on the HDFS.

If I copy the file to the file system on the Cluser node, I will be able to use it in the following way:

 

add file /home/user/pythonUDFs/my_udf.py;
add file /home/user/data/some_data_needed_in_my_udf.tsv ;

FROM
some_table
SELECT TRANSFORM
( var1,
var_2)
USING 'my_udf.py'
AS
(out_var1,
out_var2);

 

But I would like not to copy the "some_data_needed_in_my_udf.tsv" to the local Hadoop node file system and read it directly from the HDFS.

 

How can I do that?

 

Thanks

Highlighted
Posts: 1,491
Kudos: 246
Solutions: 226
Registered: ‎07-31-2013

Re: Read HDFS files from Python UDF

You can use a 3rd party Python package that talks to HDFS via WebHDFS (https://pypi.python.org/pypi/pywebhdfs for example) and import and use that within your UDF. The package must be sent along, or be installed across the cluster for the import to succeed.

 

Note however, that the Python UDFs have no context (such as a Configuration object access possible by some methods in Java UDFs). Thereby, they will have no access to security tokens used to talk to HDFS from task containers across the clusters. If you have a secure cluster, this approach would fail unless you find a way to pass tokens or re-authenticate via a keytab from within the UDF code, prior to the use of WebHDFS.

 

IMHO, it is better to continue your current ADD FILE approach since that would work with secure clusters also, without requiring any change.

Backline Customer Operations Engineer
Announcements