About mqureshi

mqureshi · ‎09-14-2016

It says submitted to scheduled. What else do you have running on cluster? It appears that the job is in queue. Can you check what you have allocated to your YARN queue and the credentials for the user running the job(likely sqoop). Does this user have enough resources allocated to run yarn jobs. Seems like a queue issue.

mqureshi · ‎09-14-2016

@Saikrishna Tarapareddy The type and size of hardware needed for Nifi are really dependent on your load. Nifi stores data on disk while processing it. So you need sufficient disk capacity for your content repository, flow file repository as well as provenance (data lineage) repository. Have you enabled archiving (I am assuming, yes). Then, for how long do you archive your data? You need space for that. To your question about whether Nifi is memory intensive or processor intensive, the answer is processor. Unless, you are doing bulk loads which I think you should not, you likely want to make sure you have enough processing power. Please see following link for performance expectations. http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_Overview/content/performance-expectations-and-characteristics-of-nifi.html

mqureshi · ‎09-14-2016

@Gaurab D Can you please share logs for the job? There should be more info you can find either in Ambari or just look under /var/log/sqoop folder or may be /var/log/yarn

mqureshi · ‎09-08-2016

@Eric Periard I think what I am understanding from your question is your manager wants file blocks compressed at a lower level than HDFS (like at linux level). Is that right? If not, please elaborate your question. When you enable compression for Hadoop using Lzo, you are compressing files going into HDFS. Remember HDFS splits the files into its blocks and places blocks on different nodes (after all, it's a distributed file system). LZO is one of the compression mechanisms that allows for compressed blocks for files that have been split on different machines. It provides a good balance between read/write speed and compression ratio. You would have to compress all your files either upon ingestion or later on. At Hadoop level, to enable compression for the output being written by your MapReduce jobs, see the following link. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html

mqureshi · ‎09-08-2016

@Alex Raj I am a little confused by the following statement: "We have HBase tables where the data is in in Binary Avro format." HBase stores data in HFiles and it's HBase's own format and not Avro. May be what you mean is you are exporting data from HBase into Avro and using Hive to read that data. If this is true, you can continue to do that as there are some advantages to this approach but if you want to keep data in HBase without moving it, then you can simply use Phoenix on top of HBase to read that data without moving it. In fact you can use Hive to read data in HBase. It's slow compared to Phoenix but it will do the job. May be that's what you are doing right now. On the other hand, if you want to use Phoenix on top of HBase, you can read HBase tables from Phoenix using SQL. Again, you don't have to export data. Here is a link to quick start Phoenix. The point is Avro doesn't come into play here and it's a little confusing why you are asking for Avro format. Between Phoenix and Drill, I would recommend using Phoenix because it's solely created for HBase and has better features and support compared to Drill.

mqureshi · ‎09-08-2016

Hi @Michael Gregory Livy is a Rest server that acts as a Spark client. So you need to open nothing more than normal spark ports, listed here. I would very highly recommend going over slide 14-31 on this link. These are just images so it should be easy and simple to go over this but it will give you a better understanding of Livy and will give you confidence in what you are trying to do.

mqureshi · ‎09-08-2016

@Zack Riesland Yes, it is safe to remove these folders and do a clean up. There are already actually cleanup scripts for this. Basically when a client runs a query with HiveServer2, Hive first creates these temporary folders to store intermediate/temporary data. For most queries, this is cleaned up at the end of query but sometimes due to issues with the query, these files are still hanging and you have to do a manual cleanup. Check this link for more details. https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-TemporaryFolders Following link might also give you some ideas on how to cleanup. https://community.hortonworks.com/questions/19204/do-we-have-any-script-which-we-can-use-to-clean-tm.html

mqureshi · ‎09-08-2016

Great. If you are using Kafka, then have a separate zookeeper for Kafka. I would never recommend using same Zookeeper for Kafka. Separate that out and the "funkiness" you have seen should go away :).

mqureshi · ‎09-07-2016

@Kyle Travis To start with please check the following link for some basics which I assume you already know, but I think it's still a good start. https://community.hortonworks.com/questions/55201/number-of-zookeepers-in-a-3-rack-cluster-with-data.html Now, to the question whether you should have separate zookeeper quorums. You mention "a few hundred nodes". I am assuming it's a HBase cluster and Storm is writing to HBase. Let me make some assumptions about your application. 1. High read/write throughput. 2. Time sensitive. HBase latency is important. 3. Other hadoop components also running, like Hive or Phoenix may be. 4. Kafka is not being used. If my assumptions are reasonable and close to what you really have, I think having a separate Zookeeper for HBase might give you some benefits. Zookeeper is very sensitive with timeouts. Serving multiple components at this scale, it might make sense to give HBase its own Zookeeper. This would ensure better HBase operation/stability as compared to when you share Zookeeper across. Although, someone might argue that, you shouldn't run into any issue with just one zookeeper either. Have you seen any issues in your testing? For all the rest of cluster, just one zookeeper quorum is fine. However, if you are using Kafka, I will have one quorum for everything including HBase and one zookeeper for just Kafka. Kafka is very fragile with Zookeeper. In my experience, it is better Kafka has its own Zookeeper. This brings to last point of your question. Two zookeepers are currently not supported by Ambari. There is a jira open but no support yet. https://issues.apache.org/jira/browse/AMBARI-14714

mqureshi · ‎09-06-2016

Well, this really depends on your tolerance for failure. Zookeeper requires a quorum of servers to be up at any time. It uses a majority quorum to make a decision. Zookeeper is up when ceil(N/2) servers are up where N are total number of servers in the quorum. For 3 node zookeeper, you can tolerate one failure. For 5 node zookeeper, you can tolerate up to 2 failures. the reason I would recommend 5 zookeeper nodes in your case is because you have a 100 node cluster. To make sure your business continuity and be confidently tolerate couple of failures, it's better to go with 5 zookeepers. Also, think about planned maintenance. With five zookeepers, you can take one out for maintenance and still have a tolerance of one more failure. With three zookeepers, maintenance is also a challenge. That being said, now that you know the implications of going with 3 vs 5 zookeepers, you can decide to go with three zookeepers knowing that in case of one zookeeper failure, you have limited window to bring the failed zookeeper up because one more zookeeper failure means risk to business.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Sqoop jobs are not running

Re: NiFI Server Configuration

Re: Sqoop jobs are not running

Re: Lz0 is enabled now what?

Re: Does Apache Phoenix or Drill support Binary Av...

Re: How does the Livy server for Zeppelin communic...

Re: Safe to delete under /tmp in HDFS (how about /...

Re: ZK Best Practices

Re: ZK Best Practices

Re: Number of Zookeepers in a 3 rack cluster with ...