Support Questions

Find answers, ask questions, and share your expertise

How to find missing partitions on a hive table.?

avatar
Super Collaborator

Hi,

I have a Hive Table partitioned by process_dt . so if my data ingestion process is creating a partition per day.

if i want to findout all the missing partitions at the end of the week or month etc..how can i find.?

is there a SQL or command to find that.?

Regards,

Sai

9 REPLIES 9

avatar
Expert Contributor

@Saikrishna Tarapareddy

Is it about finding missing partitions in Hive Metastore or in HDFS directories ?

You can execute " msck repair table <table_name> " command to find out missing partition in Hive Metastore and it will also add partitions if underlying HDFS directories are present.

But it will not delete partitions from hive Metastore if underlying HDFS directories are not present .

hive> msck repair table mytable;
OK
Partitions missing from filesystem:

avatar
Super Collaborator

@rtrivedi

I was asking about finding missing HDFS Directories.?? If everything goes well I should have 7 HDFS folders and Hive Partitions for a week if I partition by day. at the end of the week I want to run a command\process to check if I got all the folders\partitions and if any missing...this can be a week or month etc.,

avatar
Expert Contributor

@Saikrishna Tarapareddy

You are correct Hive Partitions are literally directories under the table name in the hive warehouse. (HDFS) You can simply inspect HDFS for that table and you will see the partition folders, and see if there is data inside them.

@rtrivedi pointed you to correct command to see if there are missing partitions that have not been added to the Hive Metastore. See msck for more detail.

If you feel this answer was helpful please mark it as the best answer. If you feel @rtrivedi was more helpful please mark it as the best answer.

avatar
Super Collaborator

@Matt Andruff

sorry , if I confused both of you. But I am not looking a way to find missing partitions when HDFS folders exists.

I am looking to find missing HDFS folders (which otherwise are Hive Partitions). I am trying to see if I can find out from a Hive command.

Regards,

Sai

avatar
Expert Contributor

@Saikrishna Tarapareddy

Got it.

You want to use the following hive command:

MSCK REPAIR TABLE table_name;

from the documentation:

...which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.

If you are hardcore and don't like that command you could figure out the existing paritions with:

SHOW PARTITIONS table_name;

and then explore the data to find what isn't added and add the partitions manually:

ALTER TABLE table_name ADD PARTITION partition_spec

avatar
Expert Contributor

I reread, you response... you can compare

SHOW PARTITIONS table_name;

with

select distinct(partion_column) from table_name;

That should work out... but I'm not really sure how you inserted the data in the first place without specifying the partition.

avatar
Super Collaborator

@Matt Andruff

both will not work. That will compare partitions. i guess there is no way to find what i am trying to do without writing a script.

I create a partition when a file arrives and on some days if files didn't come thru those partitions wont exits.

at the end of the month I want to find out on which dates I didn't get the files (so partitions wont exist) , I have to manually go and check HDFS or check show partitions command to findout missing.

I was checking to see if that can be found from hive partitions..

Regards,

Sai

avatar
Expert Contributor

Additional information that a new partition *could* be added per day. If a day was missed how would it be detected?

You could write some very esoteric SQL:

do some date math, create a table, add the values for {start-date} to {end-date},

Write a UDF that created an array for that start-date to end date

Use Explode with Lateral View to expand that into a table of values

Join this table to SHOW PARTITIONS table_name; -- never tried this but it should work, or like I said before joining on "select distinct(partition_column) from table" should work.

But even if you had a SQL script that did this... so what... it won't alert you that something is wrong. It's sql, it doesn't have scripting functions like alerts. You'd still be manually running it....so why not write a script that does alert you?


This really is a programming issue not a SQL issue. I think writing a script is likely the best idea. You already probably have a script that runs daily to insert this data... why not add a condition that if there isn't data to add, that it writes to a "No data for this day table"? Just a thought. Hope this helps.

avatar
Super Collaborator

@Matt Andruff ,

that's true , I already came up with a process.

i did that using a couple of tables , one to write all successful files with names and dates and another for missing files and dates. i populate these on the file insertion NiFi flow. i join those to find if the missing files ever come back and landed success table.

Thanks anyway for your time on this.

Regards,

Sai