We have some fact tables which contains large number of rows. We have partition applied on the month right now. It is most more likely that in coming future we might need to apply partition by week number. As update command is missing in Hive so whenever there is situation to update the historical data we just drop the partition & create a new partition. So applying partitions is necessary.
I am wondering is applying partitioning on the existing columns in hive table POSSIBLE ?
How to handle the situation where we have to apply the partitioning dynamically based on the load ?
i think dropping the table & recreating table for most of the requirement is not good thing
Partitions can be added, renamed, exchanged, dropped, or (un)archived by using the PARTITION clause in an ALTER TABLE statement, as described below. To make the metastore aware of partitions that were added directly to HDFS, you can use the metastore check command (MSCK) or on Amazon EMR you can use the RECOVER PARTITIONS option of ALTER TABLE. See Alter Either Table or Partition below for more ways to alter partitions.
You cannot change the partitioning scheme on a table in Hive. This would have to rewrite the complete dataset since partitions are mapped to folders in HDFS.
What you need to do is create a new table with the new partitioning scheme and load the data into it from the old table:
CREATE TABLE NEWPARTITIONING ( COLUMNS ... ) PARTITON ON ( MONTHS INT, DAY INT ) as SELECT * from tablewitholdpartitioning.
Loading a large number of partitions at a time can result in bad loading patterns so be careful and follow the guidelines in my doc:
Is it good if i put one extra column on the partitioned on clause like PARTITON ON ( MONTHS INT, DAY INT , LOADBALANCER INT). Loadbalancer column in the source database, which is SQL Server, will have value 1 for the normal load. If source table has more data load then Loadbalancer column will have more values like 1,2,... We can create a stored procedure in sql server will update the OLTP Loadbalancer column values in case we fell that we need to further partition data beyond month & day.
How it will be in long run instead of drop the existing dataset & recreating it ?
How long will a reload take? Depends on your load settings ( how many reducers see ppt ) and the data volume. As I said loading a large partitioned table is a bit tricky:
Regarding the partitioning scheme: You can do month, day or just day. Depends on which columns you want to filter on. ( having just day is a bit easier for rollin, rollout but multi-level partitioning is also possible )
Loadbalancer idea in a sec
We are now loading our existing historical data into hive. Major fact tables have around 2 million or more rows. Loading 1.7 million rows took 3 hours in virtual box having 6 cores ,24 GB ram & 128GB disk.
I got your point. load balancer column should be some dimension column which is mostly used in the filtering.
I do not understand the LOADBALANCER idea. Partitions only help you if you can filter based on the partition. But you will never run a query SELECT * FROM TABLE WHERE LOADBALANCER = 1 right? Then Hive needs to read all folders anyway which will actually result in worse performance in the HiveServer ( lots of partitions result in overhead in the split generation ). The day partitioning only helps you if you have queries like SELECT * FROM TABLE WHERE DAY = 3;
I got your point. Loadbalancer idea is not making sense. I was just thinking to break the data into small datasets so that query just check smaller dataset to prepare the output dataset. I am moving data from hive table (staging & unpartitioned) to another table (production & partitioned). Staging table has 1.7 million rows but query is failing with error Error : Java heap space . do i need to increase the memory allocated to JVM ? Staging tables might have 5 million & more rows so what should be the most likely value of the memory allocated to the JVM??
Look at the presentation for settings to increase your task RAM. OOM can occur for example in ORC creation. But normally they only occur if a task needs to write to too many partitions at a time therefore using one of the redistribution mechanisms in the ppt is helpful. How do you load them? Sqoop directly? Is your target table ORC? Are you loading one partition (day, month) at a time or a full time range?
As explained in the PPT dynamically loading into dozens of partitions easily leads into OOM exceptions. Redistribute your data to increase load performance and avoid OOM.
1. Can you tell me the url of the presentation so that i can increase the RAM ?
2. I have import data from sql server to hive table without specifying the any file format and data import successfully into hive table. Now i am trying to copy data from hive table to another table which as parquet format defined at table creation. I am trying to copy into all the partitions which are possible based on the combination of the three columns.
I have used insert into table t1 partition (c1,c2,c3) select * from t2. Coping from one table to another (Parquet).