Support Questions
Find answers, ask questions, and share your expertise

How can I run a (python) UDF from Hive using the ADD ARCHIVE option?

Has anyone successfully ran a custom UDF (in python, or any other language), using the "add archive" option?

I've created a python function within a virtual environment.

I now want to run this function as a Hive UDF using the "ADD ARCHIVE /tmp/python_venv.zip" syntax. I have zipped up my virtual environment in .zip format as well as a .tar.gz. But Hive continues to throw errors...

My HiveQL script looks like this:

ADD ARCHIVE /tmp/hiveudf.zip;

select transform(model) using 'python ./hiveudf/hiveudf1.py' as (word STRING) from trucks limit 10;

I also tried to run:

select transform(model) using 'python hiveudf1.py' as (word STRING) from trucks limit 10;

My (partial) error message looks like this: Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:0, Vertex vertex_1470355361002_0012_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0

NOTE: I am using python anaconda and am also running Hive from the command line (beeline and the Ambari query editor throw authentication errors - disallow transform operations).

Thanks for your help!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How can I run a (python) UDF from Hive using the ADD ARCHIVE option?

select transform(inputCol1, inputCol2) using 'myScript.py' as myCol1, myCol2 from myTable;

'myScript.py' should be in the root of the archive file you're adding.

Alternatively, and, I think better, is to ship a virtualenv as an archive, and add your UDFs as separate files. This way you can produce one commonly used virtualenv archive and use it with many separate UDFs that depend on the virtualenv and you wont have to produce a new archive everytime you create a new UDF:

add archive virtualenv.tgz;

add file myUDF1.py;

add file myUDF2.py;

select transform(inputCol1, inputCol2) using 'myUDF1.py' as myCol1, myCol2 from myTable;

select transform(inputCol1, inputCol2) using 'myUDF2.py' as myCol1, myCol2 from myTable;

View solution in original post

2 REPLIES 2

Re: How can I run a (python) UDF from Hive using the ADD ARCHIVE option?

select transform(inputCol1, inputCol2) using 'myScript.py' as myCol1, myCol2 from myTable;

'myScript.py' should be in the root of the archive file you're adding.

Alternatively, and, I think better, is to ship a virtualenv as an archive, and add your UDFs as separate files. This way you can produce one commonly used virtualenv archive and use it with many separate UDFs that depend on the virtualenv and you wont have to produce a new archive everytime you create a new UDF:

add archive virtualenv.tgz;

add file myUDF1.py;

add file myUDF2.py;

select transform(inputCol1, inputCol2) using 'myUDF1.py' as myCol1, myCol2 from myTable;

select transform(inputCol1, inputCol2) using 'myUDF2.py' as myCol1, myCol2 from myTable;

View solution in original post

Re: How can I run a (python) UDF from Hive using the ADD ARCHIVE option?

New Contributor

Hi Randy, is this everything you need to submit archived virtual environment with Hive query? just add archive virtualenv.tgz; and that's it? So, the Python UDF will have all required dependencies? I am having difficulties with this, particularly some non-default python library cannot be imported in some of the Hadoop nodes, and submitting virtual env with my hive script seems not helping. I would appreciate any comment on that!

Thanks!