Support Questions

Find answers, ask questions, and share your expertise

Hive partitioning

avatar
Explorer

I have a huge dataset and I want to split it into partitioned table, but I'm wondering, the second column is date in int and I wonder is it possible to split table based on this, because I don't want to create three more columns with year, month and day. 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

@Anitauzanna you could do it exactly the same way.

e.g.

CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;

 

INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;

View solution in original post

8 REPLIES 8

avatar
Guru

@Anitauzanna could you explain the meaning of below

 

 the second column is date in int

 

Could you please provide an example for the scenario??

avatar
Explorer

Yeah, i mean my table definition is something like this: fact1,fact2, date eg. (20190213), fact 4. What I want to do, is to create partitioned table based on year, month and day, but without creating three more fields with year, month and day. I worry that in approach of first creating "working" table and then putting it into final table it might get too slow, as everyday it's huge amount of data.

avatar
Master Collaborator

@Anitauzanna sure  you could do that.

 

I believe you have date in the int format, e.g. 20210927 . Correct me if I am wrong.

You could create a non-partitioned table and load the dataset into it.

Once done, create the final partitioned table, with the 2nd column as a partition.

 

e.g.

 

CREATE TABLE table_final (col1 int, col3 int) PARTITIONED BY (col2 int) STORED AS orc;

 

Set the following Hive properties, and then load the data to the new partitioned table:

 

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT OVERWRITE TABLE table_final PARTITION(col2) SELECT col1,col3,col2 FROM <first table>;

I hope this answers your question. If it does please accept this as the solution.

 

avatar
Explorer

Yeah I think it should work, but correct me if I'm wrong, in the new table, date will be at the end of table? Also, I'm trying to partition it based on year then month then day, and I wonder if I can do it just by substring, and without dividing data field into 3 fields. 

avatar
Master Collaborator

@Anitauzanna partition column will appear at the end of the table if you query it from hive cli or beeline.

But partition col data will not be part of the actual table data when you check in hdfs. They will appear as directories in hdfs filesystem.

 

Answering your second question, yes, you could use substring(or regexp_extract)  to take part of a column and use it as a partition. check : https://community.cloudera.com/t5/Support-Questions/Hive-partitions-based-on-date-from-timestamp/td-...

avatar
Explorer

Yeah I saw that, but I'm not really sure if it will help me, as I want to first do partition by a yeah then month then day so the path would be like this /year/month/day/file so for example if person that runs query only know year and month it won't go through entire database.

avatar
Master Collaborator

@Anitauzanna you could do it exactly the same way.

e.g.

CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;

 

INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;

avatar
Explorer

Could help to explain what's the meaning of first 2 sentence? 

Set Hive.....

 

What would happen if we don't have 2 sentences? How would that impact our query?

 

Thanks.