Created 09-27-2021 01:49 AM
I have a huge dataset and I want to split it into partitioned table, but I'm wondering, the second column is date in int and I wonder is it possible to split table based on this, because I don't want to create three more columns with year, month and day.
Created 09-28-2021 08:11 AM
@Anitauzanna you could do it exactly the same way.
e.g.
CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;
INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;
Created 09-27-2021 10:46 AM
@Anitauzanna could you explain the meaning of below
the second column is date in int
Could you please provide an example for the scenario??
Created 09-28-2021 12:25 AM
Yeah, i mean my table definition is something like this: fact1,fact2, date eg. (20190213), fact 4. What I want to do, is to create partitioned table based on year, month and day, but without creating three more fields with year, month and day. I worry that in approach of first creating "working" table and then putting it into final table it might get too slow, as everyday it's huge amount of data.
Created 09-27-2021 02:22 PM
@Anitauzanna sure you could do that.
I believe you have date in the int format, e.g. 20210927 . Correct me if I am wrong.
You could create a non-partitioned table and load the dataset into it.
Once done, create the final partitioned table, with the 2nd column as a partition.
e.g.
CREATE TABLE table_final (col1 int, col3 int) PARTITIONED BY (col2 int) STORED AS orc;
Set the following Hive properties, and then load the data to the new partitioned table:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_final PARTITION(col2) SELECT col1,col3,col2 FROM <first table>;
I hope this answers your question. If it does please accept this as the solution.
Created 09-28-2021 12:18 AM
Yeah I think it should work, but correct me if I'm wrong, in the new table, date will be at the end of table? Also, I'm trying to partition it based on year then month then day, and I wonder if I can do it just by substring, and without dividing data field into 3 fields.
Created 09-28-2021 03:50 AM
@Anitauzanna partition column will appear at the end of the table if you query it from hive cli or beeline.
But partition col data will not be part of the actual table data when you check in hdfs. They will appear as directories in hdfs filesystem.
Answering your second question, yes, you could use substring(or regexp_extract) to take part of a column and use it as a partition. check : https://community.cloudera.com/t5/Support-Questions/Hive-partitions-based-on-date-from-timestamp/td-...
Created 09-28-2021 07:36 AM
Yeah I saw that, but I'm not really sure if it will help me, as I want to first do partition by a yeah then month then day so the path would be like this /year/month/day/file so for example if person that runs query only know year and month it won't go through entire database.
Created 09-28-2021 08:11 AM
@Anitauzanna you could do it exactly the same way.
e.g.
CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;
INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;
Created 07-15-2022 07:25 PM
Could help to explain what's the meaning of first 2 sentence?
Set Hive.....
What would happen if we don't have 2 sentences? How would that impact our query?
Thanks.