Created 03-23-2021 12:10 PM
Hi All,
I have created an external table pointing to HDFS location where data gets stored for everyday logs.
Details:
location: /user/data/year=2021/
partition: month and day
hdfs dfs -ls /user/data/year=2021/
Found 5 items
drwxr-xr-x - user user 0 2021-03-19 16:53 /user/data/year=2021/month=03/day=18
drwxr-xr-x - user user 0 2021-03-20 16:04 /user/data/year=2021/month=03/day=19
drwxr-xr-x - user user 0 2021-03-21 16:59 /user/data/year=2021/month=03/day=20
drwxr-xr-x - user user 0 2021-03-22 16:57 /user/data/year=2021/month=03/day=21
Is there a way where my external table partitions get updated automatically when new file gets added to hdfs path.
Now, when I run manually, the table gets updated.
hive>msck repair table <table_name>
Please let me know if there is anyway to update the table automatically, when the location gets updated.
Thank You!
Created 03-30-2021 01:48 PM
Hi,
Can someone please help me with this..!
Created 03-31-2021 07:59 AM
The "msck repair table ..." command does not really read new data files, but adds new partitions (subdirectories in HDFS) in table metadata.
What you could do is to create in advance all the partitions (for month or more) - initially empty- and run the "repair" command just once:
hdfs dfs -mkdir /user/data/year=2021/month=04/day=1
...
hdfs dfs -mkdir /user/data/year=2021/month=04/day=30
hive>msck repair table <table_name>
When You put your log files inside one of these directories, they will be immediately visible from Hive (just set the correct permissions using Ranger or hdfs).
Maybe You can repeat this operations (create directories and "repair table") during logs maintenance, as you should have some policies to remove old logs
Hope this helps