About mqureshi

mqureshi · ‎01-30-2017

@SBandaru Please see my replies inline below: 1. Can I install and configure only Zookeeper, HBase services without installing HDFS, Yarn etc.,? You can do without YARN but not without HDFS. HDFS is where HBase stores data. But then you are not running any spark or map reduce jobs on HBase. Pretty much nothing except your HBase API to access data. 2. If 'Yes' to above questions, what are the pros and cons of installing HBase with and without Namenode, Resource Manager? You cannot do it without namenode. That's a must for any Hadoop cluster. I can't think of any pros of not having YARN. It doesn't take a lot of resources or space by itself and is absolutely required to run anything on top of HBase, like Hive, Spark, MapReduce and so on. There are bunch cons of not doing it but may be the only pro is you have a much simpler environment without having any additional project than those required at a minimum. 3. Can anyone share "best practices" for HBase cluster? What are your application requirements. Depending on whether you want to optimize for read or writes, there are different ways to go about setting up. One thing that remains consistent across use cases is a good key design. Cannot over emphasize this.

mqureshi · ‎01-29-2017

This sounds pretty simple. Here is how I would do it but you can follow your own path. 1. Import XML archive data into Hadoop. My next step is optional but to me that's the right way to do it. 2. I will flatten XML into Avro and then ORC (lot of material available on this). I will use nested type to retain XML structure and its going to be more efficient when reading. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-ComplexTypes https://orc.apache.org/docs/types.html Like I said, this is optional. You can keep your data in XML and directly read XML from hive. 3. I will initially keep compression enabled with Snappy but might disable it if the data set is not too large and queries bottleneck on CPU. That's pretty much it. It's pretty straight forward use case.

mqureshi · ‎01-27-2017

@Joe Harvy There are many use cases where you will need to get file, change format, extract/drop records, filter json and so on. Your use case does not seem to be one. But fair enough, if you don't agree and would still like to go the path you want. I am sure somebody would you a better answer that can validate your approach.

mqureshi · ‎01-27-2017

@Joe Harvy Nifi is best used for ingesting live streaming data with 1000's of records per second. For your use case, why not simply import the file in Hadoop in a staging area, create temp table, and then do insert select using Hive. While inserting, simply change the format to ORC.

mqureshi · ‎01-26-2017

but after you did this, did you refresh the zookeeper?

mqureshi · ‎01-26-2017

@Karan Alang Something is missing. Your HMaster log is pointing to following location but hbase.rootdir=hdfs://sandbox.hortonworks.com:8020/encrypt_hbase2/hbase/data your hbase-site.xml points to following <property> <name>hbase.rootdir</name> <value>hdfs://sandbox.hortonworks.com:8020/encrypt_hbase1/hbase/data</value> </property> Let's try this. shut down everything. hadoop fs -rm -r /encrypt_hbase1/hbase/data/* //should have no affect as this should be empty echo "rmr /hbase-unsecure" | zookeeper-client // this should cleanup everything in zookeeper. Then only start zookeeper. Make sure it's green in Ambari. Then start only HMaster and no need to restart region servers until HMaster is successfully started.

mqureshi · ‎01-26-2017

@Karan Alang I am assuming you stopped when you cleaned up zookeeper and restarted after it. For now, shut down your HBase. We need to look into zookeeper. Can you please share zookeeper logs (/var/log/zookeeper/)? Since, this is sandbox, do you have any data there? Can you try running "zookeeper-client" and share what your output is?

mqureshi · ‎01-26-2017

We have a storm topology that usually runs fine. We have no errors in the logs and times are fast. However, sometimes we get spikes in the "complete latency" Here are the individual bolts: So my question is: What causes the Complete latency to be much more than the individual bolts? ( We are using the Microsoft spout and event hubs if that helps).

mqureshi · ‎01-26-2017

can you please share the hbase-master server logs? first start the master and then start region servers. When you run the same zkcli ls / command, do you see /hbase-unsecure back? You should because master should recreate this znode and everything under. It might take a while so start master, give it some time. check /hbase-unsecure exists and also check the subfolders. see following link. does the structure in your /hbase-unsecure matches what's explained in the link below? https://community.hortonworks.com/articles/73627/hbase-zookeeper-znodes-explained.html

mqureshi · ‎01-25-2017

@Avijeet Dash I'll try to answer in detail and before I answer, let me give you some context. Think about the traditional database world (aka legacy world). Imagine you have a large Oracle/MySQL/DB2 database where you are bringing your transaction data. These are live transactions and you have thousands of transactions per second. This is very time sensitive for you and has been tuned and sized precisely down to milli second level. You know exactly how many transactions happen every second and any changes to the volume of data ingested or the type of queries run can impact your system. You monitor this very closely. Now imagine, for a second that you don't have an EDW. Your business comes and says, we would like to run some queries to gain business insights from this data. You say, hold on. I can't let you run these kind of queries against my transactional system. It is very precise and size appropriately for what it's doing. If you start running the kind of queries you want to run (multiple joins, aggregation etc), then you are going to take away resources meant for my transactions and blow up all my SLAs. Sorry, I can't let you do that. That's when you suggest business that what they need is a separate database where they import all this data and model it differently (may be lot more indexes and more denormalized). Then they can move data on a nightly basis (ETL) from your transaction system when load is low and run the type of queries they want to run on their own separate database (let's call it EDW). HBase is that transactional system and Hive is very similar if not exactly that data warehouse. HBase today does not run with YARN (yes, Slider, but I haven't seen a production deployment yet). This means managing those resources to make sure that HBase SLAs are not impacted if someone runs a big Hive query (think Tableau generating a ridiculous query) is a difficult task. So the answer to your question is how sensitive is your HBase? Do you care if someone slows down your HBase or vice versa (HBase slowing down Hive). If you can manage this aspect, then there is nothing wrong in running both in same cluster. A lot of customers do that - as long as you know what you are doing. However, if you have tight SLAs, then may be you want to consider separate clusters. It really depends on your use case.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: HBase Cluster Setup

Re: cases where changing hadoop block size is not ...

Re: From CSV to Hive via NiFi

Re: From CSV to Hive via NiFi

Re: HDFS Encryption Zone - HBase shutting down

Re: HDFS Encryption Zone - HBase shutting down

Re: HDFS Encryption Zone - HBase shutting down

Storm complete latency more than individual bolts

Re: HDFS Encryption Zone - HBase shutting down

Re: HIVE and HBASE clusters