Support Questions

Find answers, ask questions, and share your expertise

Error Handling during Pig LOAD Function

Contributor

Does anyone have experience using how Pig can handle error Tuples during the LOAD function?

E.g. if we LOAD 10 lines which are comma delimited using PigStorage(',') yet the 9th line of the input data is Pipe delimited. What controls do we have on how these tuples are parsed and which Variable (relation) they are assigned to?

Ideally, I'd like to have one Relation/Variable loaded with the successful rows and some other relation holding the rows which were not parsed properly.

1 ACCEPTED SOLUTION

Unfortunately there is no real exception handling in Pig. The usual tip is to use UDFs.

If the logic becomes too complicated for Artem's approach you could create a valid function in Java and couple it with the SPLIT. Adding a Java UDF is really simple in Pig.

DATA = LOAD '/my/input/folder';

SPLIT A INTO GOODDATA IF valid($0), OTHERWISE BADDATA;

STORE GOODDATA into '/tmp/good'

STORE BADDATA into '/tmp/bad';

and the valid function would be a simple Java EvalFunction similar to the example below. You could check if the data has the expected number of pipe symbols the correct datatypes etc.

https://pig.apache.org/docs/r0.7.0/udf.html#How+to+Use+a+Simple+Eval+Function

View solution in original post

4 REPLIES 4

Mentor

you can use if statement @Wes Floyd or maybe in your case since you use PigStorage(',') you can filter on commas and filter out pipe. Then load it again but PigStorage('|').

here's an example with split

A = LOAD 'data' AS (f1:int,f2:int,f3:int);

DUMP A;                
(1,2,3)
(4,5,6)
(7,8,9)        

SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

DUMP X;
(1,2,3)
(4,5,6)

DUMP Y;
(4,5,6)

DUMP Z;
(1,2,3)
(7,8,9)

Mentor

@Wes Floyd @Benjamin Leonhardi I was also thinking load using PigStorage() without delimiter and then do either regex or split or filter and route to output file.

It would be really convenient if PigStorage Serde would exist as a Pig function as well. Then one could load it as a String check if its valid with SPLIT and then parse it into a tuple.

Something like:

A = LOAD 'myfile';

B = SPLIT IF PigStorage_valid($0) GOODDATA, OTHERWISE BADDATA;

C = FOREACH B GENERATE PigStorage_parse($0)

...

But since this doesn't exist I think the only options are to write these functions yourself or as Artem says use a regex, filter, ... to verify correctness write it and load it again with PigStorage.

Unfortunately there is no real exception handling in Pig. The usual tip is to use UDFs.

If the logic becomes too complicated for Artem's approach you could create a valid function in Java and couple it with the SPLIT. Adding a Java UDF is really simple in Pig.

DATA = LOAD '/my/input/folder';

SPLIT A INTO GOODDATA IF valid($0), OTHERWISE BADDATA;

STORE GOODDATA into '/tmp/good'

STORE BADDATA into '/tmp/bad';

and the valid function would be a simple Java EvalFunction similar to the example below. You could check if the data has the expected number of pipe symbols the correct datatypes etc.

https://pig.apache.org/docs/r0.7.0/udf.html#How+to+Use+a+Simple+Eval+Function

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.