Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to store the output to a variable based file in pig?

Highlighted

How to store the output to a variable based file in pig?

New Contributor

Hi,

I am using pig to process some data and I am successfully able to store the data to a folder name I gave. After that with below command I am able to move the file to a desired location with a desired name.

fs -mv /usr/local/hadoop/data/result/tmp/tmp8/part-m-00000 /usr/local/hadoop/data/result/Final_KPI/femto.json;   

What I want to do is, I want to give dynamic names instead of femto.json and the names will be based on some pig script outputs like below

XML_MAIN_1 = LOAD '/usr/local/hadoop/data/tar/*' using org.apache.pig.piggybank.storage.XMLLoader('md') as (x:chararray);

SERIAL_AND_DATE = FOREACH XML_MAIN_1 GENERATE XPath(x, 'md/neun'), XPath(x, 'md/mts');

SERIAL_T = FOREACH SERIAL_AND_DATE GENERATE REGEX_EXTRACT(REGEX_EXTRACT($0, '(.*),bSRName=(.*)', 1), '(.*)Fsn=(.*)', 2);

SERIAL_F = LIMIT SERIAL_T 1;   

STORE SERIAL_F INTO '/usr/local/hadoop/data/result/tmp/tmp1';

SERIAL_C = LOAD '/usr/local/hadoop/data/result/tmp/tmp1/part-r-00000' as (Serial:chararray);

Imagine I will have a file under Final_KPI as the name with the result of the dump of SERIAL_C. How can I do that?

5 REPLIES 5

Re: How to store the output to a variable based file in pig?

I'm not 100% following your request, but if you have an attribute in a relation that you want to use as the basis of the dynamic file names you create, then check out https://community.hortonworks.com/questions/46839/creating-a-iterativa-loop-using-apache-pig.html#an... to see if MultiStorage is what you need.

Re: How to store the output to a variable based file in pig?

New Contributor

Thanks for the answer Lester, I think it might be helpful but I wasn't able to produce what I wanted.

This is the json file I have after moving my file to my desired location,

fs -mv /usr/local/hadoop/data/result/tmp/tmp8/part-m-00000 /usr/local/hadoop/data/result/Final_KPI/femto.json;

my initial idea is to have a generic fs -mv command that ends something like'$KPI_DATE'-'$SERIAL'.json

this serial and date stored inside the json file itself, as well as during the scripting they are produced seperately.

My json file:

{"SERIAL":"5450406299"},{"VS_MeanNumCSCall":"0"},...{"KPI_DATE":"20160630"}

So I want to produce file named 20160630_5450406299.json

Re: How to store the output to a variable based file in pig?

Guru

If I understand the question correctly, this is a classic case of passing a parameter to the pig script. See https://wiki.apache.org/pig/ParameterSubstitution

In your case, you would pass into the script a param like

PATH=/usr/local/hadoop/data/result/Final_KPI/femto.json

and in your script you would have

SERIAL_C = LOAD '$PATH' as(Serial:chararray);

Re: How to store the output to a variable based file in pig?

New Contributor

No actually it should be the other way around. I want to store my files in dynamic names based on the outputs of scripts.

SERIAL_C has the output of 5450406299 for one folder, 5450406200 for another for example. I want to use this output as a parameter when I am storing the output to a file. As

/usr/local/hadoop/data/result/$SERIAL_C/part-m-00000

then I can move the file using the same parameter to a folder I want.

Re: How to store the output to a variable based file in pig?

Guru

@Erdal Kucuk

I believe what you are looking for is org.apache.pig.piggybank.storage.MultiStorage function whose constructor is

MultiStorage(String parentPathStr, String splitFieldIndex)

where splitFieldIndex is the index of the field you want to split and whose values you want as names in separate files, and parentPathStr is the directory that will hold these multiple files.

Thus you would load your data to HDFS as follows:

STORE SERIAL_F INTO '/usr/local/hadoop/data/result/Final_KPI' USING org.apache.pig.piggybank.storage.MultiStorage('/usr/local/hadoop/data/result/Final_KPI','5');

(where I am assuming the data field to name the file is the 6th field position in the file).

Some references are below.

(Note, these references use the constructor MultiStorage(String parentPathStr, String splitFieldIndex, String compression, String fieldDel) but you may be able to get by with the simpler one shown.

http://stackoverflow.com/questions/9314449/how-to-store-grouped-records-into-multiple-files-with-pig

http://margus.roo.ee/2014/12/18/apache-pig-how-to-save-output-into-different-places/

https://community.hortonworks.com/questions/20487/store-output-file-as-3-files-using-pig.html