Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Corrupted HDFS on Restart of Cluster Built with Director and CM

avatar
Contributor

I have built a 6 node cluster for a Proof of Concept on AWD using Director. The intent of this set of tests is to determine if we can build a Cloudera cluster with Director, run a series of jobs, use CM and EC2 to stop all the servers when done and then restart them when further jobs need to be executed, i.e.. A predefined lab cluster that gets charged only when in use.

 

I am having problems with corrupt hdfs files on restart. I am tryng to pinpoint the right place to look for clues. Is it a 'connection' btween Director and the Cluster (something defined when the cluster was built) or something about using EBS and restating the servers. Any ideas or place to look for clues would be helpful. Here is some background:

 

Setup :

 

(1 Director, 1 CM, 1 Master, 3 Workers and  1 Gateway). I am using CM/CDH 5.4.7.  

 

CM and Master are m3.xlarge.

Gateway and Director are m3.medium

Workers are m3.large.

 

I built a new AMI that is based on ami-8767d1ec, but that I have upgraded to Java 1.8_45. The root drive has ben adjutsed to use 250 GB EBS.

 

Problem:

 

The original creation is built and starts fine. CM shows 100% green. We can add data, run mapreduce jobs on Yarn, examine results, etc. 

 

I then stop the CM cluster services, stop the CM Management services, and then go into EC2 to stop all servers.

 

I then restart the machines, go into CM server and restart CM cluster and managment services. 

 

All servces comeup fine initially with the exception of HDFS, which starts with an error on the Canary test.

 

Then eventually HDFS goes into safemode and Hive also triggers a metastore canary test error and both services are flagged red and the Namenode goes down.  Leaving safemode and restarting the Namenode brings me back to the original error, but eventully it finds itself back to safemode wothout a namenode.

 

Here are some messages from the Namenode UI:

 

Safe mode is ON.
The reported blocks 193 needs additional 192 blocks to reach the threshold 0.9990 of total blocks 385.
The number of live datanodes 3 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

878 files and directories, 385 blocks = 1263 total filesystem object(s).

Heap Memory used 166.71 MB of 990.75 MB Heap Memory. Max Heap Memory is 990.75 MB.

Non Heap Memory used 46.81 MB of 47.38 MB Commited Non Heap Memory. Max Non Heap Memory is 130 MB.

 

I examined the missing blocks and the majority are blocks that are there via CDH: Oozie directories, etc.

 

Any ideas on how I might track this down?

 

Thanks

 

-  rd 

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Start/stop of instances is not supported via Director. You are probably seeing corrupted blocks because ephemeral storage is lost during the stop/start cycle, but per the HDFS Durability section of the reference architecture, HDFS on EBS is not a supported configuration: <http://www.cloudera.com/content/www/en-us/documentation/other/reference-architecture/PDF/cloudera_re...

 

A different way to achieve your goal of only paying for instances when you need them using Director would be to define your cluster in a configuration file, create the cluster using Director when you need it, and tear it down afterward.

 

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Start/stop of instances is not supported via Director. You are probably seeing corrupted blocks because ephemeral storage is lost during the stop/start cycle, but per the HDFS Durability section of the reference architecture, HDFS on EBS is not a supported configuration: <http://www.cloudera.com/content/www/en-us/documentation/other/reference-architecture/PDF/cloudera_re...

 

A different way to achieve your goal of only paying for instances when you need them using Director would be to define your cluster in a configuration file, create the cluster using Director when you need it, and tear it down afterward.

 

avatar
Contributor

Jadair,

 

Your suggestion is on option on our list. We were hoping to save the trouble of reloading large amounts of data between uses. I will look at your references. Amazon's EMR service provides the capability I am looking for. However, I wanted very much to use Cloudera on EC2 for our solution and that is why I was including these tests as part of our Proof of Concept.  Thanks for the valuable input you have provided over the last couple of weeks.

 

-  rd