Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

UDF issue in PIG (CSV file referred in UDF and its called in PIG)

Highlighted

UDF issue in PIG (CSV file referred in UDF and its called in PIG)

New Contributor

I am facing issue while calling a UDF in PIG.  UDF is referring a CSV file.  When I call that UDF in PIG, I am getting the below error .

 

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: pig_lib.FN_LOCMATCH [./CITY_DATA.csv (No such file or directory)].  CSV file is present in the path where jar is placed.

 

REGISTER '/home/proj/RESOURCE/pig_lib_new.jar';  

 

CSV file is placed in this path /home/proj/RESOURCE

 

I tried by hardcoding the CSV file path in Java class.  But still I am getting the same below error.

 

Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: pig_lib.FN_LOCMATCH [/home/proj/RESOURCE/CITY_DATA.csv (No such file or directory)]

3 REPLIES 3

Re: UDF issue in PIG (CSV file referred in UDF and its called in PIG)

Master Guru
Have you also thought about these?

1. Is the UDF running on the same machine as your file? In Pig's context, a
UDF is a function applied atop its MR jobs (except in specific cases), and
the jobs themselves execute across the cluster's NMs/TTs, and not on the
invocation node.
2. Is the file opening code designed to run in the front-end, or the
back-end? The errors indicate the code runs in the back-end, i.e. within
tasks of a fired Pig job (see 1).
3. Are you using a java.io.File API to load the file, or an
org.apache.hadoop.fs.FileSystem one instead? The latter may need special
config to be asked to look locally.
4. Often a silly question, but does your program's executing user actually
have rights to access the path? Remember also that in MR tasks, the user
running the real task JVMs may not be the same user who fired the job from
the grunt shell.

Re: UDF issue in PIG (CSV file referred in UDF and its called in PIG)

New Contributor

Hi Harsh J,

 

    Thanks for your immediate reply.  I have used Java.io.file to open the file.  I am able to use the funciton in Hive and its working fine.  

I have given the below command before using the funciton in Hive.

 

add file hdfs:////home/proj/RESOURCE/CITY_DATA.csv;

create temporary function FN_LOCMATCH as ‘hive_lib. FN_LOCMATCH’;

 

User has the access for this path.  Pls shed some light to resolve the issue.

 

Re: UDF issue in PIG (CSV file referred in UDF and its called in PIG)

Master Guru
That operation in Hive will cause Hive to ship CITY_DATA.csv in
distributed-cache of all the query jobs, and will place the file (or a
symlink of it) local to the task JVM's runtime PWD (allowing you to open it
locally via a relative ./CITY_DATA.csv reference).

Are you ensuring to do the very same thing in Pig (as ADD FILE in Hive)? It
is tied into your requirement the same way as it is in Hive.

Pig does not have an easy way such as ADD FILE to do this equivalent, but
you can set some similar properties to get the very same effect. A bit of
web search reveals this blog post that illustrates how, as one example:
https://ragrawal.wordpress.com/2014/03/25/apache-pig-and-distributed-cache/