Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Expert Contributor

When an object gets created in Hive metastore, we expect to see the entry in Atlas almost instantaneously. Often times, might be because of disruption in availability of some of the core components on which Atlas is dependent on or the methodology that was used sometimes to create metastore objects(ex: Directly use metastore API to create Hive objects) where Atlas hook doesn't get invoked, we some times see missing tables in Atlas.

This article aims at identifying the missing table entries in Atlas. Once the missing table entries are identified, we can post them to Atlas. The easiest method of doing that will be explained in a later article(Will update the reference link in this article when I am done writing the second article).

Step#1: Identify the Hive tables from the metastore.

There are multiple ways of gathering this information. The easiest way is to write an SQL over "hive" database to get all the information. In my example, the metastore is based on MySQL. From command line on Hive metastore

mysql -u root -e "use hive;SELECT NAME, TBL_NAME FROM DBS as a, TBLS as b where a.DB_ID=b.DB_ID;" > tables.txt

sed -i '1d' tables.txt 
(The above command will get rid of the heading)

Step#2: Download the attached findMissingTablesInAtlas script to your host into your home directory or any directory of your choice. Make sure the tables.txt file with the table names is present in the same directory as well.

Step#2.1: Make sure you have the comma separated user id and password in you ~/.cred file. The user id should have nesessary previliges to post to Atlas. if you are missing those, you can grant them using Ranger policy.

$ cat ~/.cred 
<my user id>,<my password> 

Step#3: Edit the script and make changes as per your environment. Change the atlasServer and clusterName values. The script will by default read the tablex.txt from the same location as in python script.

Step#4: Thats it! Run the script using the below command:

python findMissingTablesInAtlas.txt

Step#5: After the script execution was successful, you should see the below two files in the same directory:

list_of_tables_exists_in_atlas

list_of_tables_not_exists_in_atlas

In the next posting, instructions for posting these tables to Atlas will be provided.

3,233 Views
Comments
avatar
Rising Star

If Hive is running on postgresql then :

To get tables.txt from postgresql run below command:-

psql -d hive -c "SELECT \"NAME\", \"TBL_NAME\" FROM \"DBS\" as a, \"TBLS\" as b where a.\"DB_ID\"=b.\"DB_ID\";" > /tmp/tables1.txt

Then to make tables1.txt compatible with the python script run below command.

awk '{print $1" " $3}' tables1.txt >> tables.txt 

Now open the file tables.txt and delete the first line which should be something like "----".
Now press escape key on keyboard and type below command to get it ready to be used by the findmissingtablesinatlas.txt file

Please note to get ^Icharacter you need to use the tab key.

:%s/ /^I/g 
avatar

Hi,

did you happen to reach to the end of this? How did you add the tables to atlas once you know what tables are missing?