About mqureshi

mqureshi · ‎08-04-2017

@Kent Brodie I am assuming you run major compactions probably once a week or some regular schedule. So that is not an issue. Do you have a lot of snapshots? Here is how snapshots work. When you create a snapshot, it only captures metadata at that point in time. So in case you ever have to restore to that point in time, you restore snapshot. Through metadata that was captured, Snapshot knows which data to restore. Now, as HBase is running, you might be deleting data. Usually when Major compaction runs, your deleted data is gone for good. Disk space is recovered. However, if you have Snapshots created which are pointing to data that is being deleted, HBase will not delete that data because what if you trying to recover to that particular point in time by restoring the snapshot? So, in that case, the data that snapshot is pointing to is moved to archive folder. The more Snapshots you have, the more archive folder will grow as needed by Snapshots. I can only guess, but a reasonable guess of what you are seeing is that you have too many snapshots.

mqureshi · ‎08-04-2017

@Mohammedfahim Pathan You assign queues when you run your jobs. For example in spark you can specify --queue parameter. In YARN configuration you specify ACL on who is part of the queue. This can be users that are running those "tools". So you cannot say that a queue is meant for "Hive", but when you limit a queue to a group using Hive only, then in a way you achieve your purpose. <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value>u:user1:queue1,g:group1:queue2,u:%user:%user,u:user2:%primary_group</value> <description> Here, <user1> is mapped to <queue1>, <group1> is mapped to <queue2>, maps users to queues with the same name as user, <user2> is mapped to queue name same as <primary group> respectively. The mappings will be evaluated from left to right, and the first valid mapping will be used. </description> </property>

mqureshi · ‎07-31-2017

@younes kafi Please see replies inline below: 1/ Should Kafka brokers be located within the same data nodes, or should they be on separate nodes? Which way is better in term of performance ? Is it possible to have Kafka on a datanode when Kafka is installed using HDF? Is this for production? Before answering I would suggest you engage someone from you local Hortonworks Account team to help you answer these questions. Depending on your data ingest, you might need dedicated Kafka servers. In other cases you may co locate Kafka on data nodes (rarely happens in production unless its something very small). Even when you co locate Kafka on data nodes, make sure you give it dedicated nodes and its own Zookeeper. Kafka must have its own Zookeeper. Also Zookeeper should have its own disks. Not large capacity disks but its own disks. 2/ Can Kafka and NiFi shares the same zookeeper or should Kafka have its own ZK used exclusively by Kafka? Ideally you don't want Zookeeper to be shared. Kafka should get its own Zookeeper. That being said, in my personal opinion, sharing Zookeeper with Nifi will be okay. Just don't add any new component beyond these two to Zookeeper dedicated for Kafka. 3/ Does the installation of NiFi by HDF(ambari) apply the needed system requirements such as max files handles, max forked processes ...or should theses requirement be done before proceeding to the installation by ambari? No, when Ambari is managing Nifi, it enables you to configure Nifi. IT is not going to make OS level changes. Imagine you make OS level changes from Nifi which affects everything else on that server. You don't want that. 4/ Is it possible to have a node that belong to both a HDF and HDP clusters at the same time with same ambari agent running on the node ? Two things here. New version of Ambari manages both HDP and HDF. And yes you can install HDF services on HDP cluster. Please see the following link. https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_installing-hdf-and-hdp/content/install-hdp.html

mqureshi · ‎07-30-2017

@Himanshu Mishra Go to HBase shell, and run "describe test". This will show you how table was created in HBase. A better way to create tables in Phoenix is to follow the following convention: CREATE TABLE TEST (MYKEY VARCHAR NOT NULL PRIMARY KEY, A.COL1 VARCHAR, A.COL2 VARCHAR, B.COL3 VARCHAR) Above statement will create two column families "A" and "B" with qualifies "Col1" and "Col2" in column family A and "col3" in column family B. When you create tables from Phoenix, it also adds an empty key-value for each row so queries work in a SQL like fashion without requiring you to write all projections in your query. Following link describes how columns are mapped from Phoenix to HBase. https://phoenix.apache.org/faq.html#How_I_map_Phoenix_table_to_an_existing_HBase_tablention

mqureshi · ‎07-25-2017

@PJ These directories exists on journal nodes if that's what you are using or whatever disk you will specify in ambari for namenode when you do your install. I think you will find the following link helpful. https://hortonworks.com/blog/hdfs-metadata-directories-explained/

mqureshi · ‎07-25-2017

@PJ If you are just looking for redundancy then it is achieved by writing namenode metadata on journal nodes (typically three), and both standby and active name node point to same journal nodes. When active namenode goes down, Zookeeper, simply needs to make standby node active and it already pointing to same data which is replicated on three journal nodes. If you don't have journal nodes and you have only one namenode, then your namenode metadata is written only once but here it is recommended that you use RAID 10 array so one disk failure is not going to result in data loss. To answer your question whether two copies of metadata are present, the answer is it depends. If you are using RAID 10 then your disk array is making a copy of blocks but that's not really a copy in the sense you are asking. If High Availability is enabled and you are using journal nodes, then you do have three copies of metadata available on three different nodes.

mqureshi · ‎07-20-2017

@Dhiraj What is your question? There is enough online material available if you just want to know the differences between the two. Following is a good article that summarizes the two approaches and help guides which approach to use when. https://community.hortonworks.com/articles/2473/rolling-upgrade-express-upgrade-in-ambari.html

mqureshi · ‎07-19-2017

@Bala Vignesh N V Have you tried doing groupByKey(), reduceByKey or aggregate()?

mqureshi · ‎07-19-2017

@Jobin George Can you shutdown the cluster, then delete your flow file from node 4, then add the node in Ambari, before starting verify new flow file is not there and then start the cluster. I know this is wrong because we should be able to add a node without bringing the cluster down but I just want to see what might make it work.

mqureshi · ‎07-19-2017

@Bala Vignesh N V It may be your first line and not the subtract function. try removing one extra slash from your hdfs path. Badically use the following: sc.textFile("hdfs://data/spark/genome-tags.csv") or if you haven't provided hadoop config before then use the following: sc.textFile("hdfs://<namenode uri>:8020/data/spark/genome-tags.csv")

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: HBASE "archive". How to clean? My disk spa...

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: how to get HBase column qualifier name of a ta...

Re: namenode metadata directories

Re: namenode metadata directories

Re: Rolling Vs Express Upgrade

Re: Using countByValue() for a particular column i...

Re: HDF 3.0 - Issue with Adding a new NiFi Node(s)...

Re: Removing header from CSV file through pyspark