Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Pig - Store a complex relation schema in a hive table

avatar
Expert Contributor

here is my deal today. Well, I have created a relation as result of a couple of transformations after have read the relation from hive. the thing is that I want to store the final relation after a couple of analysis back in Hive but I can't. Let see that in my code much clear.

The first String is when I LOAD from Hive and transform my result:

july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;  
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31; 
july_gr = GROUP jul_cl_fl BY (day,start_station); 
july_result = FOREACH july_gr { 
           total_dura = SUM(jul_cl_fl.duration); 
           avg_dura = AVG(jul_cl_fl.duration); 
           qty_trips = COUNT(jul_cl_fl); 
           GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips; 

};

So, now when I try to store the relation july_result I can't because the schema has changed and I suppose that it's not compatible with Hive:

STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer ();

Even if I have tried to set a special scheme for the final relation I haven't figured it out.

july_result = FOREACH july_gr {
              total_dura = SUM(jul_cl_fl.duration);
              avg_dura = AVG(jul_cl_fl.duration);
              qty_trips = COUNT(jul_cl_fl);
              GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
              };

PDS: the table in hive exists previusly!!

1 ACCEPTED SOLUTION

avatar

@Andres Urrego I had faced a similar issue. This is due to the implicit cast not working correctly, thus you must do explicit casting.

Also, I recommend the avg_dura be of float type to avoid loss of precision.

Replace the generate flatten statement in your query as shown below with explicit casting.

GENERATE FLATTEN( (int) group) as day:int, (int) total_dura as total_dura:int,

(float) avg_dura as avg_dura:int,(int) qty_trips as qty_trips:int;

If this does not help, then please share sample data, the schema of existing hive table and I will solve it.

View solution in original post

7 REPLIES 7

avatar

@Andres Urrego I had faced a similar issue. This is due to the implicit cast not working correctly, thus you must do explicit casting.

Also, I recommend the avg_dura be of float type to avoid loss of precision.

Replace the generate flatten statement in your query as shown below with explicit casting.

GENERATE FLATTEN( (int) group) as day:int, (int) total_dura as total_dura:int,

(float) avg_dura as avg_dura:int,(int) qty_trips as qty_trips:int;

If this does not help, then please share sample data, the schema of existing hive table and I will solve it.

avatar
Expert Contributor

thanks so much @Dinesh Chitlangia I set the output format finally like :

GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int). 

27451-fotmat.png

Now before storing the output in HIVE I have created the table below:

CREATE TABLE july_analysis
(day int,code_station int, total_dura double,avg_dura float,qty_trips int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

My problem now is when I try to store the data because I get back a message saying:

STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer ();
2017-08-14 09:56:55,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias july_result

I saved the output as file to confirm that everything was coming up right and that worked , I also to the moment to open my pig consoled I taped Pig -x tez -useHCatalog.

thanks for whole info you can provide I apprefciate.

Andres U,

avatar

@Andres Urrego - Sounds good. I would love to replicate this scenario and help you. Can you share the POC.July file that you are loading in Pig. even if you give me a sample - like first few lines of that input file, it will help.

avatar
Expert Contributor

I couldnt load it here cuz it's a little big huge. So please download from my one drive just clicking here .

Thanks buddy!

avatar

Thanks for sharing the data. Issue has been resolved. See the new answer. Please consider accepting and upvoting the new answer.

avatar
@Andres Urrego

There were multiple issues on the script that you posted. Also, thank you for sharing the test data, I was able to spot other issues too.

1. Your input file in pig is a CSV but you are using HCatLoader.
This is incorrect, Hcatloader is only used when you want to load data in Pig from a Hive table.
In you case you can put the data in a folder in HDFS and then load data from that path in Pig.

2. Since you loaded data in Pig without specifying the schema, then you must refer to the columns by position like $0, $1 and so on instead of referring to them by name like start_date,duration etc.

Refer this page to understand the LOAD statement in Pig

3. When using ToDate function in Pig, either your data must match the default date format, or you must specify the date format that you have in you data. Refer to the script I have added below to see the same.

Click here to understand the syntax of ToDate() in Pig.

Click here to understand how to create date format string

4. When storing data from Pig to Hive using HCatStorer, ensure that the alias in Pig has fields names and data types to avoid any mismatch or type casting issues.

After going through the sample data and replicating the issue, I was able to solve and load the data correctly.

Here is the Pig script:

july = LOAD '/hdfspath/july.csv' USING PigStorage(',');

july = filter july by $0!='start_date'; //to remove header line

july_cl = FOREACH july 

	GENERATE GetDay(ToDate((chararray)$0,'yyyy-MM-dd HH:mm')) as day:int,

	$1 as start_station:chararray,(int)$4 as duration:int;

jul_cl = FILTER july_cl BY day==31;

july_gr = GROUP jul_cl BY (day,start_station);

july_result = FOREACH july_gr {

total_dura = SUM(jul_cl.duration);

avg_dura = AVG(jul_cl.duration);

qty_trips = COUNT(jul_cl);

GENERATE FLATTEN( group) AS (day, code_station),

(double)total_dura as (total_dura:double),

(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int);

}

STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer();

In the above script you were trying to store INTO 'poc.july_analysis'. Ensure there exists a database in hive called 'poc' and there exists the 'july_analysis' table in that poc database.

If the poc database does not exist, then either create it or avoid using it(create july_analysis in default hive database).

Here is the Hive Table DDL I used:


CREATE TABLE july_analysis(

day int,

code_station string, 

total_dura double,

avg_dura float,

qty_trips int)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

avatar
Expert Contributor

You are so amazing I really appreciate each of your comments and the time that you have put on. thanks so much. Just to let you know buddy the part that I forgot to tell you is that before going to pig I load the file information in a Hive table within the DB POC. then this is why I used:

july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader;

Then the data coming up from Hive already have a format and the relation in Pig will match the same schema.

the problem is that even after setting a schema for the output I'm not able to store this outcome in a Hive table 😞 . so to get my real scenario you should:

1. Load the CSV file in HDFS without headers (I delete them before to avoid filters)

run: tail -n +2 OD_XXX.csv >> july.csv

2. Create the table and load the file:

Hive:

create table july ( start_date string, start_station int, end_date string, end_station int, duration int, member_s int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
LOAD DATA INPATH '/user/andresangel/datasets/july.CSV'
OVERWRITE INTO TABLE july;

3. Follow my script posted up to the end to try to store the final outcome on a hive table 🙂


thanks buddy @Dinesh Chitlangia