About mqureshi

mqureshi · ‎05-11-2017

@Jon Page Is your Ambari able to see the disk. Basically you need to add new values for dfs.datanode.dir that point to these missing disks. Look at this screenshot. You need to add new lines for these nodes. screen-shot-2017-05-11-at-102807-am.png

mqureshi · ‎05-11-2017

@Venkatesan G One namenode object uses about 150 bytes to store metadata information. Assume a 128 MB block size - you should increase the block size if you have lot of data (PB scale or even 500+ TB in some cases). Assume a file size 150 MB. The file will be split in two blocks. First block with 128 MB and second block with 22MB. For this file following information will be stored by Namenode. 1 file inode and 2 blocks. That is 3 namenode objects. They will take about 450 bytes on namenode. For example, at 1MB block size, in this case we will have 150 file blocks. We will have one inode and 150 blocks information in namenode. This means 151 namenode objects for same data. 151 x 150 bytes = 22650 bytes. Even worse would be to have 150 files with 1MB each. That would require 150 inodes and 150 blocks = 300 x 150 bytes = 45000 bytes. See how this all changes. That's why we don't recommend small files for Hadoop. Now assuming 128 MB file blocks, on average 1GB of memory is required for 1 million blocks. Now let's do this calculation at PB scale. Assume 6000 TB of data. That's a lot. Imagine 30 TB capacity for each node. This will require 200 nodes. At 128 MB block size, and replication factor of 3. Cluster capacity in MB = 30 x 1000 (convert to GB) x 1000 (convert to MB) x 200 nodes = 6 000000000 MB (6000 TB) How many blocks can we store in this cluster? 6 000 000 000 MB/128 MB = 46875000 (that's 47 million blocks) Assume 1 GB of memory required per million blocks, you need a mere 46875000 blocks / 1000000 blocks per GB = 46 GB of memory. Namenodes with 64-128 GB memory are quite common. You can do a few things here. 1. Increase the block size to 256 MB and that will save you quite a bit of namenode space. At large scale, you should do that regardless. 2. Get more memory for name node. Probably 256 GB (Never had any customer go this far - may be someone else can chime in). Finally, read the following. https://issues.apache.org/jira/browse/HADOOP-1687 and in your link (notice for 40-50 million files only 24 GB is recommended - half of our calculations. Probably because block size assumed at that scale is 256 MB rather than 128 MB)

mqureshi · ‎05-08-2017

@Arkaprova Saha First of all, Nifi is far more mature and easier to use and will do the job much better. That being said, let me try to answer your questions. 1. If I have millions of data in the database, How can I proceed with multi treading option ? Depends on what you mean by multi threading. Are you getting data from one table? In that case use tasks.max in your property file. Check this link, it has description of how to use it. 2. Can we use multiple broker in Kafka connect If you are getting data from different tables and you submit jdbc connector to a distributed cluster, it will automatically divide the work in number of tasks equal to number of tables across different workers. 3. How can we implement security in this offload from RDBMS to Kafka topic? Your RDBS security will be implemented by your database and will require authentication from your program as well as authorization of what that user can read. As for Kafka, the security is described in this link here. You can authenticate clients via SSL or Kerberos and authorization of operations. There is nothing special. You should have that working even if you are not doing this and it will be no different. 4. During data offload lets say my server goes down . How it will behave after kafka server restart? Please take a look at this discussion. This should answer your questions.

mqureshi · ‎05-08-2017

@Dinesh Chitlangia you can use PhoenixRuntime.AUTO_COMMIT_ATTRIB as supported in following class. https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/util/PhoenixRuntime.java Please see the following JIRA to make sure you are using the version of Phoenix that supports it. https://issues.apache.org/jira/browse/PHOENIX-1559

mqureshi · ‎05-08-2017

@saravanan gopalsamy Please check your /etc/sysconfig/network file. What is the value of hostname? This should be fully qualified domain name. Then compare it with /etc/host file. , @saravanan gopalsamy Ambari is not able to resolve reverse DNS. What is the value of your hostname in /etc/sysconfig/network It should be your fully qualified domain name. Is this what your /etc/host is also pointing to?

mqureshi · ‎05-05-2017

@Smart Data What is the value of hive.server2.authentication in hive-site.xml? Default is NONE. Assuming this is none, and you don't specify a username, driver will default to using "anonymous" as the username.

mqureshi · ‎05-05-2017

@Ahmad Debbas Your exception is java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver Are you passing the jdbc driver to your program. Also you are specifying com.sqlserver.jdbc.Driver. It is looking for com.microsoft.sqlserver.jdbc.SQLServerDriver? I am not sure where is this coming from but your issue is not having a driver to connect to SQL Server.

mqureshi · ‎05-03-2017

@Kumar Veerappan I actually disagree that 1TB on MS-SQL equals = 3TB on Hadoop. When you store data in MS-SQL Server, what kind of storage layer you are using? If it's SAN then SAN already replicates data for resiliency. Imagine if MSSQL server was keeping only one copy of data. Without a doubt, you would have lost some data by now. If you are not storing in SAN, then you probably have a RAID array. Depending on type of RAID, you again have multiple copies of data to prevent in case of disk failure. You might be using a technique like erasure coding which you can also use with Hadoop. Here is a link for more details. Now, my next point, please understand that you probably also have a DR MSSQL or backup. You will need that for Hadoop also. We are comparing simply one MS SQL server environment vs one Hadoop cluster. In Hadoop you will and should compress data. You can find out but there is a good chance, your current SQL Server environment is also compressed. So this assumes that you do not save much by compression in Hadoop because your data is also currently compressed. If your data is currently not compressed then you'll be happy to see the results with Snappy or ZLIB in Hadoop. So going back to your original question. Data is data. If you have 100 TB of data, in Hadoop you will have to make three copies. When you compress, you might compress your data by a factor of 3 and make that replication footprint disappear. As for the size of cluster, that depends on a number of factors. Are you buying hardware for next three years (right thing to do). Then just plan accordingly. Give each node 12x2TB = 24TB capacity or you can go higher these days and a number of factors come into this. See the following link (this link ignores power consumption due to more disks - that can be big for your infrastructure team): https://community.hortonworks.com/content/kbentry/48878/hadoop-data-node-density-tradeoff.html On each node, expect 20% for temporary files/space used by running jobs.

mqureshi · ‎05-03-2017

@Karan Alang To do count(*) you need select privilege on all table. You can still do "select count(column name) from <table name>" and that will work but to run count(*) you need to have select permissions for whole table. This is working as expected.

mqureshi · ‎05-02-2017

@Sanaz Janbakhsh SOLR on HDF or SOLR on HDFS. As for SOLR on HDFS, the answer is yes. We do provide support for it through our partnership with Lucidworks. Since you are starting new, I suggest you use either HDP 2.5.5 or HDP 2.6.0 (later update to a maintenance release): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_solr-search-installation/content/ch_hdp-search-install-ambari.html

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: changing datanode.dir

Re: How a NameNode Heap size is calculated?

Re: Kafka connect with Multi Threading

Re: HBase/Phoenix - How to specify autocommit in J...

Re: amabri installation on centos failed stating t...

Re: Anonymous user requests to access on Hive HDFS...

Re: Saving a dataframe to a hive table

Re: HDFS Capacity Planning question

Re: Spark Security Using LLAP - Spark SQL Query gi...

Re: Solr on HDF