Posts: 47
Registered: ‎01-05-2016

Pig - custom UDF - Jython + Pyasn

[ Edited ]

Hi all, I'm putting up a log parser in Pig and I'm trying to use "Pyasn", a Python extension allowing offline querying of an ASN database, to extract Autonomous System Number information from IP addresses 


The link to the project is here:


What happens is that:


1) I successfully installed pyasn (in a previous try via pip-install, currently I have built it manually, but still it doesn't work)


2) I wrote a custom UDF to be later imported in Pig, prior being wrapped inside Jython:



import sys

import pyasn

def asnLookup(ip):
	asndb = pyasn.pyasn('asn.dat')
	asn = asndb.lookup(ip)

	return asn

def asnGetAsPrefixes(nbr):
	asndb = pyasn.pyasn('asn.dat')
	asn_prefix = asndb.get_as_prefixes(nbr)

	return asn_prefix


3) But when I try to register my UDF, I get the following exception:


grunt> register 'hdfs:///user/xxxxxx/LIB/PYASN/' using jython as pythonPyasn;
2017-11-23 17:18:10,468 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - is deprecated. Instead, use fs.defaultFS
2017-11-23 17:18:10,939 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=/tmp/pig_jython_8271942503558994412
2017-11-23 17:18:12,468 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2017-11-23 17:18:13,236 [main] ERROR - ERROR 1121: Python Error. Traceback (most recent call last):
  File "/tmp/pig6864734086775637011tmp/", line 8, in <module>
    import pyasn
  File "/usr/lib64/python2.6/site-packages/pyasn-1.6.0b1-py2.6-linux-x86_64.egg/pyasn/", line 20
SyntaxError: future feature print_function is not defined

4) The puzzling thing is that I'm currently doing the exact same thing with another Python extension for Geo Localization (PyGeoIP) and it works smoothly, the concept is the same, I wrote a UDF and imported it in Pig wrapping it up in Jython and I can call it successfully!


5) If, just to check things are formally OK, I open a PySpark Shell and use the extension, it works without any problems. But I don't want (can't) use Spark in this case, for a number of reasons


Any ideas/insight would be very much appreciated!





New solutions