About Mark_Heydenrych

Mark_Heydenrych · ‎11-14-2017

@Josh Elser Disabling hbase backups did not improve the situation. After sifting through the logs for the cleaner, I have identified the following series of warnings: 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: ReplicationHFileCleaner received abort, ignoring. Reason: Failed to get stat of replication hfile references node. 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] master.ReplicationHFileCleaner: Failed to read hfile references from zookeeper, skipping checking deletable files 2017-11-13 06:29:11,808 WARN [server01,16000,1510545850564_ChoreService_1] zookeeper.ZKUtil: replicationHFileCleaner-0x15fb38de0a0007a, quorum=server01:2181,server02:2181,server03:2181, baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/replication/hfile-refs org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/hfile-refs These repeat multiple times. So it appears that the replication HFile cleaner is failing due to an issue with zookeeper. We recently had some fairly severe zookeeper issues, but things have returned to a completely stable state now, apart from this. Do you have any advice for how I can move forward, either with forcing the HFile cleaner to run or with repairing the state of zookeeper?

Mark_Heydenrych · ‎11-03-2017

Thanks for the reply Josh. hbase.backup.enable is not defined on our cluster, so it defaults to true. I'll turn this to false and then see if things get to a more reasonable level. If that doesn't work I will turn on the TRACE logging, and update with extra information. If it changes anything, we're running HDP 2.6.0.3

Mark_Heydenrych · ‎11-02-2017

We are having an issue with running out of disk space on HDFS. A little investigation has shown that the largest directory, by far, is /apps/hbase/data/archive. As I understand it, this directory keeps HFiles that need to be kept, typically because of snapshots. I know that a large archive directory is the first culprit of having too many snapshots. However snapshots do not seem to be the issue here: /apps/hbase/data/archive is a little larger than110TB, while the sum of all of our snapshots is <50TB. We have not set hbase.master.hfilecleaner.ttl, however I have read that the default is 5 minutes - this is definitely not the culprit for many of the HFiles we have, which frequently date back many months. What steps can I follow to try to reduce this usage?

Mark_Heydenrych · ‎07-07-2017

Hi Matt I switched "Remove trailing Newlines" to false and got the number of fragments to 66443 as you suggested. This is a little confusing to me, as when I check the original file the number of lines is 66430. However your point is 100% correct. Thank you for opening the Jira request. While I wait for this, do you know of any useful workaround I can use in the time being to get the number of actually emitted fragments? It would be slower, but would it be possible, after the split, the merge the fragments (which would now include no newlines) and split them again? Thanks, Mark

Mark_Heydenrych · ‎07-06-2017

I have a flow in NiFi which splits a file into individual lines, inserts those lines into a database and after those have been inserted updates a control table. The control table only updates after every line has been inserted. To achieve this, the fragment.index is compared to fragment.count - if these are equal, then I know that every line has been processed and we can move on to updating the control table. However recently some of our files failed to update the control table. I have outputted the Attributes of the flow files to disk, and it shows something that confuses me: the number of flow files that comes out of the split text processor is 66430, which matches the number of lines in the file. However, the fragment.count attribute is 66443. Does anybody know why the fragment index would be incorrect, and how I can fix this?

Mark_Heydenrych · ‎05-12-2017

We recently had a failure of all of the region servers in our cluster, although the active master and standby master stayed up. When the region servers were brought back up regions were assigned relatively quickly. However two regions have not come up, and according to the UI they are stuck in the OFFLINE state. I have tried running hbase hbck -repair a number of times, as well as various other options that I hoped would help (-fixAssignments, -fixSplitParents). None of these successfully brought the regions online. I check the logs of the region servers for the regions and there was no reference to them after they were closed prior to region server failure. When I checked the master logs however I found the following: master.AssignmentManager: Skip assigning table_name,13153,1485460927890.3d68e485cb6294345fe1469097fa5aca., it is on a dead but not processed yet server: server05,16020,1494493877392 The server listed as a dead server is alive and well, with over 200 regions already assigned to it. This error message led me to HBASE-13605, HBASE-13330 and HBASE-12440 which all describe pretty much the same issue. Unfortunately none of these JIRAs describe any way to fix the issue once it occurs. Does anybody have any advice for resolving this? This is a production system and so shutting down the master is a last resort.

Mark_Heydenrych · ‎03-22-2017

I am using HBase snapshots for the purpose of backups in my cluster. I have weekly snapshots to facilitate recovery from HBase failure. However something concerns me. I was under the impression that HBase snapshots stored only metadata without replicating any data, making them ideal for low footprint backups. However, after a short time (3+ weeks) a snapshot will often be exactly the same size as the source table, sharing 0% of the data with the source table. This is a problem since it means that keeping even a few weeks of backups can consume 25+ TB of space. Can anybody explain to me why this happens, and if there is any way to avoid it?

Mark_Heydenrych · ‎03-07-2017

@Matt Clarke Thanks for this info. We do currently have a failure count loop as you suggested, which will eventually dump files in an error bucket for reprocessing later. I was just hoping to be able to identify duplicates directly from the attributes themselves. I think I will open a Jira for this.

Mark_Heydenrych · ‎03-07-2017

I was hoping to see that kind of relationship initially, and thereafter hoping that the cause of failure would be stored in an attribute. As @Matt Clarke suggested, I think I will open a Jira for this.

Mark_Heydenrych · ‎03-06-2017

Is there any way in a NiFi PutHDFS processor to capture the error message that causes failure and store it in an Attribute? For example, if the Put fails due to misconfiguration of HDFS that must be handled differently to a duplicate file. Is there any way to capture this error message in a flow files attributes?

Online	Offline
Last Visited	‎05-25-2018 05:20 AM

Member Since	‎04-22-2016 05:48 AM
Last Visited	‎05-25-2018 05:20 AM
Posts	67
Kudos received	6

Cloudera Community

Re: HBase archive directory very large

Re: Storm-hbase bolt failing over DRPC

Re: HBase archive directory very large

Re: HBase archive directory very large

HBase archive directory very large

Re: Incorrect fragment.count in nifi

Incorrect fragment.count in nifi

HBase regions stuck in OFFLINE state after Region ...

HBase snapshot disk usage

Re: Capture error message in NiFi PutHDFS

Re: Capture error message in NiFi PutHDFS

Capture error message in NiFi PutHDFS