I am wondering what is Cloudera's recommedations for journal nodes hardware.
1: Do they need the same CPU and memory as namenode?
2: SInce Journal nodes write to theri own hard disk first, I am pretty sure SSD should be used.
3: I am pretty sure it needs 10 GB NICs.
4; Is it a good idea to run jorunal node with namenode on the same host? NN and JN can write to differnt disk in this case.
We have very busy namenodes, our data change rate > 1PB/day.
1. No, they need not. Journalnodes are just writing the updates from the active NN to the journal.
They can operate with a low amount of ram and CPU compared to an NN (which keeps the whole metadata in the memory).
2. Disk writes to the journal are sequential. maybe an SSD can improve latency, but depends on your clusters workloads (interactive vs batch, etc).
3. same as above. consider this: an update to the journal is made when:
- a new file/directory is created
- a file size is updated (only when a new block is appended, or the file si closed/synced)
- a file/directory is removed
If you have 128mb HDFS blocks on average, then 1PB is only 80k blocks, not that much. Of course, if you are creating/deleting lots of small files, that might be add some more edits.
4. If you have beefy hardware, you can do that. NN and JN on strictly different disks.
In genereal, are you experiencing performance issues that could be related to this?
In CM 5.x, there is a metric called journal_rate. It shows you the number of journal operations per second.
You can check that to see if you could profit from such specifications :).
Thanks for the suggestions. We are planing to enable Namnode HA/JN, like to see suggestions from Cloudera. We do have lots of small files and NN is quite busy. We plan to install JN on namenode and standbynamenode to benefit the powerful hardware those boxes have, of course, we will let the JN write to their own disks.