I'm new to this technology. We receive the file which may contain backdated records. From which we have to load the data into the hive table which will be insert only(partitioned on trans_Date). I need to know what should be the mechanism to insert the records in the table, where trans_date is backdated. Trans_date(column) is the transaction date and Record_date(column) is the date on which the record is inserted in the table.
Hey @Prathamesh H!
My suggestion for backfill in Hive, is to use partition on hive tables.
For each new (or old) data, you will create a new partition and they will be separately from hot/cold data.
BTW: you two types of partition: dynamic and static. Usually when you want to assume that your partition will be a value from a source column ca use dynamic, or if you intended to add manually, use static.
Here's some good link with further details.
And if you're planning to test you hive tables, I'd suggest to use EXTERNAL TABLES, they give you freedom to drop the schema of the table whenever you want, without deleting the data from the table.
Here's an example:
#My data [hive@hwc1225-node4 ~]$ cat simple_ds Vinicius,27 John,28 Somebody,0 Death,99999 WalkingDead, #Creating the table (i'm using external table) hive> CREATE EXTERNAL TABLE simple_ds( > name STRING, > age INT > ) > PARTITIONED BY(year string) > LOCATION '/user/hive/warehouse/simple_ds'; hive> ALTER TABLE simple_ds ADD PARTITION(year=2017); OK Time taken: 0.321 seconds hive> show partitions simple_ds; OK year=2017 Time taken: 0.569 seconds, Fetched: 1 row(s) #CREATING THE PARTITION OUTSIDE OF HIVE [hive@hwc1225-node4 ~]$ hdfs dfs -mkdir /user/hive/warehouse/simple_ds/year=2002 #Backing to Hive to refresh its partitions [hive@hwc1225-node4 ~]$ hive hive> msck repair table simple_ds; OK Partitions not in metastore: simple_ds:year=2002 Repair: Added partition to metastore simple_ds:year=2002 Time taken: 4.479 seconds, Fetched: 2 row(s) hive> show partitions simple_ds; OK year=2002 year=2017 Time taken: 0.904 seconds, Fetched: 2 row(s) #Feeding the simple_ds table hdfs dfs -put simple_ds /user/hive/warehouse/simple_ds/year=2002/ #Oh damn i forgot to put some comments on the table and the delimiter [hive@hwc1225-node4 ~]$ hive hive> drop table simple_ds; hive > CREATE EXTERNAL TABLE simple_ds( > name STRING, > age INT > ) COMMENT 'Table Simple_DS for HCC' > PARTITIONED BY(year string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > LOCATION '/user/hive/warehouse/simple_ds'; hive> msck repair table simple_ds; OK Partitions not in metastore: simple_ds:year=2002 simple_ds:year=2017 Repair: Added partition to metastore simple_ds:year=2002 Repair: Added partition to metastore simple_ds:year=2017 Time taken: 0.689 seconds, Fetched: 3 row(s) hive> select * from simple_ds; OK Vinicius 27 2002 John 28 2002 Somebody 0 2002 Death 99999 2002 WalkingDead NULL 2002 Time taken: 1.119 seconds, Fetched: 5 row(s)
Hope this helps!