I have a file whose structure is like mentioned below:
Obviously if i give pigStorage(',') the three Fields will get splitted into 4 and the data spill over. Alternatives :
A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type)
Please suggest what are the things that i can do here. I want to have the name value in a single column.
I would do it in the following steps (sudocode)
1. regex replace all commas inside quotes to a temporary placeholder
eg. 1,"Amrit,kumar",India -> 1,Amrit$$kumar,India
2. regex replace all commas to new delim
e.g. 1,Amrit$$kumar,India -> 1|Amrit$$kumar|India
3. regex replace to return all temp placeholders to commas
e.g 1|Amrit$$kumar|India -> 1|Amrit,kumar|India
4. reload data using | as delim
Note that step 1 could be done in two steps using CSVLoader (regex only to create temp placeholder but leave the quotes; then CSVLoader to remove quotes)
Not sure exactly what you mean by "cannot alter my file here not even temp file,permission issues." Please elaborate (e.g. which file?, how does this relate to answer above)? ... looking forward to following up once I have greater information.
What @Greg Keys have mentioned should work. You are really not going to alter the file rather when processing the file in pig you are transforming the data passed through pig but the base file remains the same.
Alternatively try using "org.apache.pig.piggybank.storage.CSVExcelstorage() " for loading the csv file. It should handle the quotes by default. Hope it would help!