About jstraub

jstraub · ‎11-18-2015

The PID-file problem is actually quite common. I have seen this a couple of times already, not just zookeeper-related (e.g. services not starting because of existing pid file with invalid process id, ...).

jstraub · ‎11-18-2015

Thanks @gopal. In this case we should definitely use ORC+(new)Zlib. I'll edit my answer 🙂

jstraub · ‎11-17-2015

You could try to restart the Ambari Agents on the Zookeeper nodes or the complete ambari server. What is the current state of the Zookeeper service and host_components? could you check via Rest API <ambari-server>/api/v1/clusters/<clustername>/services/ZOOKEEPER

jstraub · ‎11-17-2015

As far as I know, this is currently not possible, not sure why this feature was not pushed in the last couple years. Maybe multi-tenancy wasn't really an issue. I dont think anyone is working on HDFS-199 at the moment. I have seen a couple requests in our internal Jira regarding this, if you open a new feature enhancement with our support team, we might be able to get the ball rolling again. Your workaround looks good, I'd keep it for now.

jstraub · ‎11-17-2015

I'd definitely open a feature enhancement, so that we can get engineering's input on that as well. Please keep me in the loop.

jstraub · ‎11-17-2015

Makes sense, but this is going to be difficult to implement. Do you mean the % of available capacity or % of general network capacity? I assume you mean the % of currently available capacity, which is changing depending on the jobs that are running. We would need a way to predict the volume of files that are going to be transferred,..... The result would be an ever changing bandwidth...Maybe it makes sense to specify a minimum and maximum bandwidth, yarn gets priority and can use the full capacity. I need to think about that a bit more... 🙂

jstraub · ‎11-17-2015

Good question! The balancer has a configurable limit, which ensures that the balancer does not utilize too much network bandwidth. You'll find the parameter dfs.datanode.balance.bandwidthPerSec in the hdfs-site.xml, the default value is 1048576 bytes per second. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml "Specifies the maximum amount of bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second."

jstraub · ‎11-17-2015

Great write-up.Thanks!

jstraub · ‎11-16-2015

Thanks for sharing! How many datasets were in the Links table? Is the dataset in Links a subset from the ABC dataset?

jstraub · ‎11-16-2015

ORC+ZLib seems to have the better performance. ZLib is also the default compression option, however there are definitely valid cases for Snappy. I like the comment from David (2014, before ZLib Update) "SNAPPY for time based performance, ZLIB for resource performance (Drive Space)." Make sure you checkout David's post: https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance As @gopal pointed out in the comment, we have switched to a new ZLib algorithm, hence the combination ORC + (new) ZLib is the way to go. The performance difference of ZLib and Snappy regarding disk writes is rather small. Btw. ZLib is not always the better option, when it comes to HBase, Snappy is usually better 🙂

Online	Offline
Last Visited	‎08-18-2019 08:21 AM

Member Since	‎09-15-2015 02:21 PM
Last Visited	‎08-18-2019 08:21 AM
Posts	457
Kudos received	472

Cloudera Community

Re: NiFi: How do I see the flowfile attributes nam...

Re: NiFi: JSON Array split

Re: Securing Solr with Ranger ERROR 500

Re: Is Ambari Infra open source?

Re: After disabling kerberos , ZKfailover not comi...

Re: Ambari shows the Zookeeper quorum as down but ...

Re: Snappy vs. Zlib - Pros and Cons for each compr...

Re: Ambari shows the Zookeeper quorum as down but ...

Re: HDFS replication factor for a directory.

Re: What rules set priority of recovery from lost ...

Re: What rules set priority of recovery from lost ...

Re: What rules set priority of recovery from lost ...

Re: How many files is too many on a modern HDP clu...

Re: Snappy vs. Zlib - Pros and Cons for each compr...

Re: Snappy vs. Zlib - Pros and Cons for each compr...