Created 05-10-2017 07:48 AM
I have a file whose structure is like mentioned below:
ID,Name,Address
1,"Amrit,kumar",India
2,"Vaibhav,arora",USA
3,"Deepika,kumar",Germany
Obviously if i give pigStorage(',') the three Fields will get splitted into 4 and the data spill over. Alternatives :
A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type)
how can i ignore " (double quotes) while loading file in PIG?
Please suggest what are the things that i can do here. I want to have the name value in a single column.
Created 05-10-2017 11:23 AM
I would do it in the following steps (sudocode)
1. regex replace all commas inside quotes to a temporary placeholder
eg. 1,"Amrit,kumar",India -> 1,Amrit$$kumar,India
2. regex replace all commas to new delim
e.g. 1,Amrit$$kumar,India -> 1|Amrit$$kumar|India
3. regex replace to return all temp placeholders to commas
e.g 1|Amrit$$kumar|India -> 1|Amrit,kumar|India
4. reload data using | as delim
Note that step 1 could be done in two steps using CSVLoader (regex only to create temp placeholder but leave the quotes; then CSVLoader to remove quotes)
Created 05-10-2017 03:06 PM
@Greg Keysi cannot alter my file here not even temp file,permission issues.
Created 05-10-2017 03:58 PM
Not sure exactly what you mean by "cannot alter my file here not even temp file,permission issues." Please elaborate (e.g. which file?, how does this relate to answer above)? ... looking forward to following up once I have greater information.
Created 05-12-2017 11:15 AM
What @Greg Keys have mentioned should work. You are really not going to alter the file rather when processing the file in pig you are transforming the data passed through pig but the base file remains the same.
Alternatively try using "org.apache.pig.piggybank.storage.CSVExcelstorage() " for loading the csv file. It should handle the quotes by default. Hope it would help!