Pig processing incorrect data using piggy bank jars

I have a file whose structure is like mentioned below:





Obviously if i give pigStorage(',') the three Fields will get splitted into 4 and the data spill over. Alternatives :

  1. I tried piggy bank jars but the issue still exist and the data still spills.Please find below the script

    A11 = LOAD 'File.csv.gz' USING as (column:type)

  2. I tried Replace fucntiion as well i was having 35k rows the change is not gettting take place for all the rows.Any how the data still spills in this case as well.Column value get shifted to next column.Please find below the referred link.

    how can i ignore " (double quotes) while loading file in PIG?

  3. I tried CSVEXCEL Storage and CSV loader as well.

Please suggest what are the things that i can do here. I want to have the name value in a single column.



I would do it in the following steps (sudocode)

1. regex replace all commas inside quotes to a temporary placeholder

eg. 1,"Amrit,kumar",India -> 1,Amrit$$kumar,India

2. regex replace all commas to new delim

e.g. 1,Amrit$$kumar,India -> 1|Amrit$$kumar|India

3. regex replace to return all temp placeholders to commas

e.g 1|Amrit$$kumar|India -> 1|Amrit,kumar|India

4. reload data using | as delim

Note that step 1 could be done in two steps using CSVLoader (regex only to create temp placeholder but leave the quotes; then CSVLoader to remove quotes)

@Greg Keysi cannot alter my file here not even temp file,permission issues.


Not sure exactly what you mean by "cannot alter my file here not even temp file,permission issues." Please elaborate (e.g. which file?, how does this relate to answer above)? ... looking forward to following up once I have greater information.

@Vaibhav Kumar

What @Greg Keys have mentioned should work. You are really not going to alter the file rather when processing the file in pig you are transforming the data passed through pig but the base file remains the same.

Alternatively try using " " for loading the csv file. It should handle the quotes by default. Hope it would help!