- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Error Handling during Pig LOAD Function
- Labels:
-
Apache Pig
Created ‎02-01-2016 04:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does anyone have experience using how Pig can handle error Tuples during the LOAD function?
E.g. if we LOAD 10 lines which are comma delimited using PigStorage(',') yet the 9th line of the input data is Pipe delimited. What controls do we have on how these tuples are parsed and which Variable (relation) they are assigned to?
Ideally, I'd like to have one Relation/Variable loaded with the successful rows and some other relation holding the rows which were not parsed properly.
Created ‎02-01-2016 05:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately there is no real exception handling in Pig. The usual tip is to use UDFs.
If the logic becomes too complicated for Artem's approach you could create a valid function in Java and couple it with the SPLIT. Adding a Java UDF is really simple in Pig.
DATA = LOAD '/my/input/folder';
SPLIT A INTO GOODDATA IF valid($0), OTHERWISE BADDATA;
STORE GOODDATA into '/tmp/good'
STORE BADDATA into '/tmp/bad';
and the valid function would be a simple Java EvalFunction similar to the example below. You could check if the data has the expected number of pipe symbols the correct datatypes etc.
https://pig.apache.org/docs/r0.7.0/udf.html#How+to+Use+a+Simple+Eval+Function
Created ‎02-01-2016 04:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you can use if statement @Wes Floyd or maybe in your case since you use PigStorage(',') you can filter on commas and filter out pipe. Then load it again but PigStorage('|').
here's an example with split
A = LOAD 'data' AS (f1:int,f2:int,f3:int); DUMP A; (1,2,3) (4,5,6) (7,8,9) SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); DUMP X; (1,2,3) (4,5,6) DUMP Y; (4,5,6) DUMP Z; (1,2,3) (7,8,9)
Created ‎02-01-2016 05:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Wes Floyd @Benjamin Leonhardi I was also thinking load using PigStorage() without delimiter and then do either regex or split or filter and route to output file.
Created ‎02-01-2016 05:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be really convenient if PigStorage Serde would exist as a Pig function as well. Then one could load it as a String check if its valid with SPLIT and then parse it into a tuple.
Something like:
A = LOAD 'myfile';
B = SPLIT IF PigStorage_valid($0) GOODDATA, OTHERWISE BADDATA;
C = FOREACH B GENERATE PigStorage_parse($0)
...
But since this doesn't exist I think the only options are to write these functions yourself or as Artem says use a regex, filter, ... to verify correctness write it and load it again with PigStorage.
Created ‎02-01-2016 05:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately there is no real exception handling in Pig. The usual tip is to use UDFs.
If the logic becomes too complicated for Artem's approach you could create a valid function in Java and couple it with the SPLIT. Adding a Java UDF is really simple in Pig.
DATA = LOAD '/my/input/folder';
SPLIT A INTO GOODDATA IF valid($0), OTHERWISE BADDATA;
STORE GOODDATA into '/tmp/good'
STORE BADDATA into '/tmp/bad';
and the valid function would be a simple Java EvalFunction similar to the example below. You could check if the data has the expected number of pipe symbols the correct datatypes etc.
https://pig.apache.org/docs/r0.7.0/udf.html#How+to+Use+a+Simple+Eval+Function
