About bleonhardi

bleonhardi · ‎02-12-2016

Q5: So the Journalnodes are for the transaction log. But there is still the fsImage which is the completely built Filesystem image. This is the "checkpoint". If a Namenode restarts he would take forever rebuilding this image from the Journalnode transaction log. Instead he reads the last version of the fsimage he has from disc and then applies any transactions he is still missing. Writing this takes time so its not done by the active namenode. In the old version the stamdby namenode would take the transaction log of the active namenode copy it over (scp whatever) merge it with the last fsimage, write it again and then copy it back to the active Namenode. In the HA settings this is similar but the failover namenode already has a current version of the image in memory he just needs to save it to disc and copy it over to the fsimage folder of the active namenode once in a while. Q3/Q2: Chris might have a better idea. I think its clear that RAID is better for master nodes to reduce any likelihood of failure in the first place. Or in other words assuming you have only three master nodes and have to colocate a JN and a Namenode I would rather have the namenode and journalnode point to the same raided disc than to two unraided ones. Regarding performance I have seen problems with big Namenodes during rebuild of the fsimage after a failure. But this was not due to disc performance the bottleneck was in the namenode memory building up the hashmap.

bleonhardi · ‎02-12-2016

That is great! Thanks Chris

bleonhardi · ‎02-12-2016

If you can run the query in one go. I.e. if you have 50 task slots free in your cluster. It would be theoretically fastest if you could run 50 tasks at the same time. So for small data amounts, small block sizes will result in more tasks and more parallelity ergo more speed. So smaller blocks guarantee high parallelity and fast response times. But they have a task creation overhead. I tried to give the example in which concrete cases you would see better results with small or big blocks.

bleonhardi · ‎02-12-2016

I need to find the paper sometimes. I actually had that question at a couple customers. I was sure 3x is safe but I didn't have any data to back me up.

bleonhardi · ‎02-12-2016

Are you sure that you are using the new sandbox and the Spark version is actually 1.3.1 or higher? It sounds like an error you would get in Spark 1.2

bleonhardi · ‎02-12-2016

I would normally say Pentaho or Birt. Both have Hive Support.

bleonhardi · ‎02-12-2016

This is actually well explained in the ambari docs. Be sure to add the services extensions to make sure the two servers don't duplicate jobs http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Ambari_Users_Guide/content/_adding_an_oozie_server_component.html http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Ambari_Users_Guide/content/_hive_high_availability.html Just as some comments: Without Load Balancer Oozie will continue to schedule jobs even if one server is down. But administration (submitting jobs ... ) would require you to point to the surviving oozie. Also what the documents not describe is how to configure the underlying database in DR, that is your job. Here is an example for Postgres: https://www.digitalocean.com/community/tutorials/how-to-set-up-master-slave-replication-on-postgresql-on-an-ubuntu-12-04-vps However databases like Mysql and Postgres seem to be very stable so making for example Oozie Server HA is much more important than doing that for the underlying database. So you could decide to just backup the database regularly.

bleonhardi · ‎02-12-2016

"As I understand it, the failover NN will be reading all those changes, via the JNs, That is true for file system changes. For block reports etc. the Datanodes communicate directly with both namenodes. They essentially duplicate every message to both instances. Thats the reason they have an almost identical in-memory image. "Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially." The failover NN continuously reads the journalnode changes. So he has an almost current instance of the fsimage in memory just like the formerly active namenode as well. "Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?" As written above. The failover NN does not start up. He is running in parallel and has an almost identical in-memory image as the active namenode. So when he takes over its practically instantaneous. He just has to read some changes from the journalnodes he didn't yet apply. "Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right? If possible the Journalnodes like the Namenodes should have raided data discs. It just reduces the chance that the journalnodes will die. In contrast to HDFS the volumes are not huge and the costs low. You can however colocate them with the Namenodes since they are pretty lightweight. No need for NFS though. "Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs." The namenodes still checkpoint. The Journalnodes only write an edit log ( similar to a transaction log in a database ) The fsImage ( which is essentially a replica of the inmemory store ) is still written to disc regularly by the failover namenode who takes the job of the standby namenode in this. "Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct? Hmmm good question. I actually never heard of performance problems because of Journalnode IO. Not that it can hurt to separate them. But even assuming a huge cluster the number of transactions per second should be well below the write speed of a modern disc or SSD. Perhaps someone else has some numbers. "Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right? As said I would use RAID. It reduces the chances of a journalnode dying significantly ( which would then put you in danger of a second dying until the first JN is fixed) . It also doesn't seem to be a high cost. You do not use RAID for HDFS because of the high cost ( thousands of discs ) and because HDFS fixes discs automatically by recreating block replicas on different nodes. You have to fix the journalnode yourself. So RAID seems to be worth it. Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?" Yes the failover namenode doesn't need to read any fsimage anymore, he already has a carbon copy. So he writes a checkpoint regularly and distributes it to the active namenode. Architecture: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html Some transaction numbers for huge clusters: https://developer.yahoo.com/blogs/hadoop/scalability-hadoop-distributed-file-system-452.html

bleonhardi · ‎02-11-2016

A poltergeist? I don't have a line 98, neither in Macos nor in linux.

bleonhardi · ‎02-11-2016

Seriously no idea. Last attempt here is my script: generate-logspy.zip But I swear I just downloaded it from the url you added

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Understanding check pointing with namenode HA

Re: Understanding check pointing with namenode HA

Re: Best practices between size block , size file ...

Re: Best practices between size block , size file ...

Re: saveAsOrcFile is not a member of org.apache.sp...

Re: BEST Open source Bussines intelligence Hadoop

Re: How to configure Oozie and Hive High Availabil...

Re: Understanding check pointing with namenode HA

Re: Flume Tutorial Error with Python Script

Re: Flume Tutorial Error with Python Script