About MattHearn

MattHearn · ‎06-30-2017

Yep, I was using the Web UI. The Web UI never reported the missing blocks; only an "hdfs fsck /" noted them. HDFS eventually copied the missing blocks over from the decom'd server on its own.

MattHearn · ‎06-30-2017

The directory I copied it to was already a known data directory. The subdirs already existed, in fact, so I assume HDFS was aware of them. In the end, HDFS actually copied the missing blocks over from the decommissioned node. It's just annoying that an fsck reports those blocks as "MISSING" when it knows where they are and that it's going to copy them eventually.

MattHearn · ‎06-30-2017

The cluster's active, so restarting that datanode isn't in the cards. In the end, the decom process actually copied the missing blocks over from the decom'd node. Not sure why it doesn't do that immediately as soon as it discovers that the blocks aren't replicated elsewhere. To be fair, the Cloudera UI only reported under-replicated blocks; it never mentioned the missing blocks, and I was able to "hdfs dfs -cat" one of the files that was reporting it was corrupted. The only thing that mentioned the missing blocks was an "hdfs fsck /". I'm assuming that HDFS is aware of the decom process and will look for the blocks on the decom'ing server, but it doesn't note that in the fsck, which is pretty annoying.

MattHearn · ‎06-29-2017

We're in the process of decommissioning some of our older datanodes. Today after decommissioning a node, HDFS is reporting a bunch of missing blocks. Checking the HDFS, looks like the files in question are RF1; I'm assuming someone manually set them that way for some reason. Since we're decommissioning, the actual blocks are still available in the data directories on the old node. So I happily copied one of them, and its meta file, over to an active node. They're in a different data directory, but the "subdirs" underneath "finalized" are the same. The NameNode still can't see the block, though. Is there a way for me to tell the NameNode "Hey, that block's over here now!" without actually restarting it? I know I can probably recommission the node I took down, fix the RF on the files, and then decom it again, but these are big nodes (each holds about 2 TB of HDFS data), and decommissioning takes several hours.

MattHearn · ‎03-28-2017

Problem solved! You pointed me in the right direction. A check of the agent log showed this error: [28/Mar/2017 11:28:09 +0000] 7731 MainThread agent ERROR Error, CM server guid updated, expected 240da00c-05c4-4053-b8a1-5ba957dfab5f, received 46d4b8a7-c2ac-4eae-8ce6-758d94046a26 When I googled it, it said I should wipe out /var/lib/cloudera-scm-agent/cm_guid. Did that, and now things seem to be working fine. Thanks!

MattHearn · ‎03-27-2017

Hi all! I'm rebuilding my sandbox cluster to use an external mysql database, and I think I'm following Cloudera's step-by-step instructions. I created the various databases, including one for the management service itself, on my mysql server outside the cluster. Then I wiped out all the server and agent software in my cluster so I could do a fresh yum install (I'm using CentOS). On the manager server, I reinstalled the cloudera-manager-server software, and then used the scm_prepare_database.sh script to set up the connection the external db; I got "Success". Then I fired up the cluster-scm-server, waited for it to come fully online, logged into the web UI, and was prompted to go through the usual steps. It successfully installed the agents on all 4 of my CentOS nodes, but then when it tried to distribute the parcels, it complained that all the hosts had bad health. I clicked out to the main desktop page to look at the hosts, and sure enough, all of them have "unknown" health. I assume that's because there's no management service set up yet, so I go to do that, but it refuses to test the connection to the database because my manager server's health is bad: Unable to test database connection for host not in good state. When I check the server log, I get more or less the same message: 2017-03-27 15:14:04,466 INFO 266434700@scm-web-16:com.cloudera.cmf.model.DbCommand: Command null(RepMgrTestDatabaseConnection) has completed. finalstate:FINISHED, success:false, msg:Unable to test database connection for host not in good state. I don't think it's the database connection because I can connect from my manager server using the mysql command line client with my user and password. Seems like it won't build the management service because it can't test the database; it can't test the database because the manager server has "bad" health; all the hosts have "unknown" health because there's no management service tracking them. Argh! Any idea how I break this circle? I don't think I screwed up the initial SCM database set up because when I connect to the database I see a bunch of tables in that db that must've been created by Cloudera, since I didn't do it. Other details: These are all VMWare VMs, running CentOS 6.8. I'm attempting to install CMS 5.10 and CDH 5.10.

Online	Offline
Last Visited	‎02-27-2020 03:05 PM

Member Since	‎03-27-2017 12:07 PM
Last Visited	‎02-27-2020 03:05 PM
Posts	6

Cloudera Community

Re: Can't add management service because the clust...

Re: Tell the NameNode where to find a "MISSING" bl...

Re: Tell the NameNode where to find a "MISSING" bl...

Re: Tell the NameNode where to find a "MISSING" bl...

Tell the NameNode where to find a "MISSING" block?

Re: Can't add management service because the clust...

Can't add management service because the cluster r...