Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pig processing incorrect data using piggy bank jars

Highlighted

Pig processing incorrect data using piggy bank jars

Rising Star

I have a file whose structure is like mentioned below:

ID,Name,Address

1,"Amrit,kumar",India

2,"Vaibhav,arora",USA

3,"Deepika,kumar",Germany

Obviously if i give pigStorage(',') the three Fields will get splitted into 4 and the data spill over. Alternatives :

  1. I tried piggy bank jars but the issue still exist and the data still spills.Please find below the script

    A11 = LOAD 'File.csv.gz' USING org.apache.pig.piggybank.storage.CSVLoader() as (column:type)

  2. I tried Replace fucntiion as well i was having 35k rows the change is not gettting take place for all the rows.Any how the data still spills in this case as well.Column value get shifted to next column.Please find below the referred link.

    how can i ignore " (double quotes) while loading file in PIG?

  3. I tried CSVEXCEL Storage and CSV loader as well.

Please suggest what are the things that i can do here. I want to have the name value in a single column.

4 REPLIES 4
Highlighted

Re: Pig processing incorrect data using piggy bank jars

Guru

I would do it in the following steps (sudocode)

1. regex replace all commas inside quotes to a temporary placeholder

eg. 1,"Amrit,kumar",India -> 1,Amrit$$kumar,India

2. regex replace all commas to new delim

e.g. 1,Amrit$$kumar,India -> 1|Amrit$$kumar|India

3. regex replace to return all temp placeholders to commas

e.g 1|Amrit$$kumar|India -> 1|Amrit,kumar|India

4. reload data using | as delim

Note that step 1 could be done in two steps using CSVLoader (regex only to create temp placeholder but leave the quotes; then CSVLoader to remove quotes)

Highlighted

Re: Pig processing incorrect data using piggy bank jars

Rising Star

@Greg Keysi cannot alter my file here not even temp file,permission issues.

Re: Pig processing incorrect data using piggy bank jars

Guru

Not sure exactly what you mean by "cannot alter my file here not even temp file,permission issues." Please elaborate (e.g. which file?, how does this relate to answer above)? ... looking forward to following up once I have greater information.

Highlighted

Re: Pig processing incorrect data using piggy bank jars

@Vaibhav Kumar

What @Greg Keys have mentioned should work. You are really not going to alter the file rather when processing the file in pig you are transforming the data passed through pig but the base file remains the same.

Alternatively try using "org.apache.pig.piggybank.storage.CSVExcelstorage() " for loading the csv file. It should handle the quotes by default. Hope it would help!

Don't have an account?
Coming from Hortonworks? Activate your account here