- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive partitioning
- Labels:
-
Apache Hive
Created 09-27-2021 01:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a huge dataset and I want to split it into partitioned table, but I'm wondering, the second column is date in int and I wonder is it possible to split table based on this, because I don't want to create three more columns with year, month and day.
Created 09-28-2021 08:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Anitauzanna you could do it exactly the same way.
e.g.
CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;
INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;
Created 09-27-2021 10:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Anitauzanna could you explain the meaning of below
the second column is date in int
Could you please provide an example for the scenario??
Created 09-28-2021 12:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah, i mean my table definition is something like this: fact1,fact2, date eg. (20190213), fact 4. What I want to do, is to create partitioned table based on year, month and day, but without creating three more fields with year, month and day. I worry that in approach of first creating "working" table and then putting it into final table it might get too slow, as everyday it's huge amount of data.
Created 09-27-2021 02:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Anitauzanna sure you could do that.
I believe you have date in the int format, e.g. 20210927 . Correct me if I am wrong.
You could create a non-partitioned table and load the dataset into it.
Once done, create the final partitioned table, with the 2nd column as a partition.
e.g.
CREATE TABLE table_final (col1 int, col3 int) PARTITIONED BY (col2 int) STORED AS orc;
Set the following Hive properties, and then load the data to the new partitioned table:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_final PARTITION(col2) SELECT col1,col3,col2 FROM <first table>;
I hope this answers your question. If it does please accept this as the solution.
Created 09-28-2021 12:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah I think it should work, but correct me if I'm wrong, in the new table, date will be at the end of table? Also, I'm trying to partition it based on year then month then day, and I wonder if I can do it just by substring, and without dividing data field into 3 fields.
Created 09-28-2021 03:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Anitauzanna partition column will appear at the end of the table if you query it from hive cli or beeline.
But partition col data will not be part of the actual table data when you check in hdfs. They will appear as directories in hdfs filesystem.
Answering your second question, yes, you could use substring(or regexp_extract) to take part of a column and use it as a partition. check : https://community.cloudera.com/t5/Support-Questions/Hive-partitions-based-on-date-from-timestamp/td-...
Created 09-28-2021 07:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah I saw that, but I'm not really sure if it will help me, as I want to first do partition by a yeah then month then day so the path would be like this /year/month/day/file so for example if person that runs query only know year and month it won't go through entire database.
Created 09-28-2021 08:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Anitauzanna you could do it exactly the same way.
e.g.
CREATE TABLE final_table(col1 int, col3 int) PARTITIONED BY (year int, month int, day int) stored as orc;
INSERT OVERWRITE TABLE final_table PARTITION(year,month,day) select col1, col3, substring(col2,1,4), substring(col2,5,6), substring(col2,7,8) from orig_table;
Created 07-15-2022 07:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could help to explain what's the meaning of first 2 sentence?
Set Hive.....
What would happen if we don't have 2 sentences? How would that impact our query?
Thanks.
