Created 07-15-2016 08:26 AM
I have an 8 node cluster running HDP 2.4. I currently have 4 regions on a large table that are stuck in the FAILED_OPEN state. When I check the logs for the regions servers I see that there is a FileNotFoundException, indicating (I believe) that the HFile does not exist. I have tried an OfflineMetaRepair in order to remove the entries but this did not help. The directories for these regions exist, but they do not contain any data. Can anybody suggest a way to repair this? If I need to perform manual surgery on the META file, can someone guide me to do this correctly?
Created 07-15-2016 08:39 AM
Here you can find more detailed information:
The FileNotFoundExceptions is coming from the split daughters not being able to find the files of the parent region. The parent regions' files might already been deleted at this point. HBCK has a flag to fix this, but if it is a handful of regions/files affected, I usually prefer to manually move the reference files out of the hbase root directory.
For reference, here is the high level flow: Go to region servers log, and find the file name for FileNotFoundException, copy the file name Check hdfs to see whether the file is really not there. Figure out whether this is an actual hfile or a reference file. HFiles are named like<region_name>/<column_family>/<UUID> while reference files are named like<region_name>/<column_family>/<UUID>.<parent_region_name>. If the missing file does not belong to the region which is throwing the exception, then it is due to the reference file referring to the missing file. So we should find and move the reference file (which should be very small) out of the daughter regions directory. Notice that the reference file name should contain the actual UUID of the referred file and the parent regions name.
Created 07-15-2016 08:35 AM
This can happen if RS went down during region splitting(this got fixed in latest versions). You need to sideline reference files of the region which is FAILED_OPEN and restart the RS. If you share the logs we can suggest you which files to be sidelined.
Thanks,
Rajeshbabu.
Created 07-15-2016 08:39 AM
Here you can find more detailed information:
The FileNotFoundExceptions is coming from the split daughters not being able to find the files of the parent region. The parent regions' files might already been deleted at this point. HBCK has a flag to fix this, but if it is a handful of regions/files affected, I usually prefer to manually move the reference files out of the hbase root directory.
For reference, here is the high level flow: Go to region servers log, and find the file name for FileNotFoundException, copy the file name Check hdfs to see whether the file is really not there. Figure out whether this is an actual hfile or a reference file. HFiles are named like<region_name>/<column_family>/<UUID> while reference files are named like<region_name>/<column_family>/<UUID>.<parent_region_name>. If the missing file does not belong to the region which is throwing the exception, then it is due to the reference file referring to the missing file. So we should find and move the reference file (which should be very small) out of the daughter regions directory. Notice that the reference file name should contain the actual UUID of the referred file and the parent regions name.
Created 07-15-2016 09:36 AM
Thank you so much for your help. The FileNotFound was referencing a different region to the one it was loading, and the issue was due to reference files. I moved each of these directories out of /apps/hbase (there were only four, so it was easy). After that I ran OfflineMetaRepair. Once I started HBase it loaded every region as it should. As a precaution I ran hbase hbck -repair and hbase hbck -repairHoles after this, and everything is fine now. Data is available for both reading and writing, and there are no regions in transition. Once again, thank you for your help.
Created 07-15-2016 05:18 PM
What Rajesh said above is right. You can use
hbase hbck -repair
to automatically fix the issue. In recent versions of HDP-2.4, you should not have experienced the bug that might cause this, but there maybe something else wrong. Did you check whether HDFS is healthy?
Created 07-18-2016 04:37 AM
I had already tried hbase hbck -repair as well as -repairHoles prior to posting the question, with no success. We had some problems with HDFS preceding this issue. HDFS showed itself as healthy, but it had previously been corrupt. I believe this was the underlying cause of the issue. We now have HBase stable again. I added a comment to the accepted answer explaining how I solved the issue on my side. Thanks for the help.