I have a daily ingestion of data in to HDFS .
From data into HDFS I generate Hive external tables partitioned by date .
My qestion is as follows , should I run MSCK REPAIR TABLE tablename after each data ingestion , in this case I have to run the command each day.
Or running it just one time at the table creation is enough .
Thanks a lot for your answers
Yes, you need to run msck repair table daily once you have loaded a new partition in HDFS location.
Why we need to run msck Repair table statement everytime after each ingestion?
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS , the metastore (and hence Hive) will not be aware of these partitions unless the user runs either of below ways to add the newly add partitions.
1.Adding each partition to the table
hive> alter table <db_name>.<table_name> add partition(`date`='<date_value>') location '<hdfs_location_of the specific partition>';
2.Run metastore check with repair table option
hive> Msck repair table <db_name>.<table_name>
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.
In addition if you are loading dynamic/static partitions to the final table from other temp table with hive statement(like insert into final table partition(..) select * from temp table), then you don't need to do any of the above methods because as you are using hive statement to load a partition then hive will update the metadata of the final table. All the above mentioned ways we have to do if you are directly adding a new directory in hdfs or other ways instead of hive.
Please refer to this link for more details regarding refresh hive metadata.