About jagadeesan

jagadeesan · ‎12-04-2018

@Michael Bronson The Namenode stores metadata about the data being stored in datanodes whereas the datanode stores the actual Data. The Namenode will also require RAM directly proportional to the number of data blocks in the cluster. A good rule of thumb is to assume 1GB of namenode memory for every 1 million blocks stored in the distributed file system. With 100 DataNodes in a cluster, 64GB of RAM on the namenode provides plenty of room to grow the cluster. So, thousands of datanodes can be handled by a single namenode, but there are many factors to consider: namenode memory size, number of blocks to be stored, block replication factor, how will the cluster be used, etc. In short, “number of datanodes a single name node can handle depends on the size of the name node (How much metadata it can hold)” Please accept the answer you found most useful

jagadeesan · ‎12-01-2018

@Gulshan Agivetova You can force Ambari Server to start by skipping this check with the following option: ambari-server start --skip-database-check

jagadeesan · ‎11-27-2018

@Amit Mishra We can configure Knox with other authentication options too other than LDAP. Here is the link to the list of supported authentication providers for Knox (i.e., LDAP, PAM, Kerberos) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/content/authentication_providers.html https://knox.apache.org/books/knox-1-1-0/user-guide.html#HadoopAuth+Authentication+Provider Please accept the answer you found most useful

jagadeesan · ‎11-26-2018

@vamsi valiveti Shuffling is the process of transferring data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%. Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers). A great source of information for these steps is this Yahoo tutorial. A nice graphical representation of this is the following: Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster) Ref Please accept the answer you found most useful

jagadeesan · ‎11-25-2018

@raja reddy You can copy the HDFS files from your dev cluster to prod cluster, then you can re-create the hive table on the prod cluster and then perform a compute statistic for all the metadata like MSCK REPAIR TABLE command. For re-creating the hive tables, you can get the create statement of the table by doing the show create table <table_name> query in your dev cluster. Following are the high-level steps involved in a Hive migration Use distcp command to copy the data present in the Hive warehouse complete database directory (/user/hive/warehouse) in Dev cluster to Prod cluster. https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/administration/content/using_distcp.html Once the files are moved to new prod cluster, take the DDL for dev cluster and create the hive tables in prod cluster. (i.e., show create table <table_name> ) https://community.hortonworks.com/articles/107762/how-to-extract-all-hive-tables-ddl.html Run metastore check with repair table, which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) Suppose if clusters are Kerberized then you can refer below links for distcp. https://community.hortonworks.com/content/supportkb/151079/configure-distcp-between-two-clusters-with-kerbero.html Note: There's no need for export because you can directly copy the data from HDFS between both clusters. Please accept the answer you found most useful

jagadeesan · ‎11-24-2018

@Philip Reilly Actually to define your choice of port by setting properties dfs.namenode.http-address for Namenode port is set via in conf/hdfs-site.xml Please accept the answer you found most useful

jagadeesan · ‎11-22-2018

@Arindam Choudhury You can use the -n & -p options to specify the username and password. For instance: beeline -u jdbc:hive2://follower-2.europe-west3-b.c.XXXXXX.internal:10000/default -n username -p password Also instead of passing password it in plaintext as option '-p', we read a password from a permission-protected password file instead, For instance: beeline -u jdbc:hive2://follower-2.europe-west3-b.c.XXXXXX.internal:10000/default -n username -w password_file In short answer for your questions, when the beeline-hs2-connection.xml is present and when no other arguments are provided, Beeline automatically connects to the URL generated using configuration files. When connection arguments (-u, -n or -p) are provided, Beeline uses them and does not use beeline-hs2-connection.xml to automatically connect. For more details you can refer below links. https://issues.apache.org/jira/browse/HIVE-14063 https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients Please accept the answer you found most useful

jagadeesan · ‎11-21-2018

@Gayathri Reddy G Step 1: We need to submit first HTTP PUT request, that will give path for TEMPORARY_LOCATION to some random datanode path location where that data will going to write. Step 2: Again submit another HTTP PUT request using the URL in the TEMPORARY_LOCATION header with the file data to be written. The client receives a 201 Created response with zero content length and the WebHDFS URI of the file in the Location header. Please accept the answer you found most useful

jagadeesan · ‎11-21-2018

@Gayathri Reddy G As such we didn't need any specific header other than content-type and charset which you already mentioned in your above command. I tried to replicate same command but i can able to write file in hdfs using curl via webhdfs Seems like you have some space in the path. Please can you verify your command again ? Reference: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

jagadeesan · ‎11-21-2018

@Gulshan Agivetova From the ambari-server log, I can see Ambari could not load version definition for HDP-2.6. I would recommended you to have clean all python libraries, yum cleanup and freshly reinstall. Could you please try below steps. appropriate to your host machine. Clean Ambari old installation which include cleaning up the old Python libraries installation yum remove ambari-server ambari-agent -y rm -f /usr/sbin/ambari* rm -f /usr/lib/python2.6/site-packages/ambari_commons rm -rf /usr/lib/python2.6/site-packages/resource_management rm -rf /usr/lib/python2.6/site-packages/ambari_jinja2 rm -rf /usr/lib/ambari-server rm -rf /usr/lib/ambari-agent Get the new repo according to your host machine wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.5.0/ambari.repo -O /etc/yum.repos.d/ambari.repo Perform a yum cleanup yum clean all Now do a fresh install of ambari binaries. yum install ambari-server -y yum install ambari-agent -y Now perform the ambari-server setup again. /usr/sbin/ambari-server.py setup --databasehost=localhost --databasename=ambari --databaseusername=ambari --postgresschema=ambari --databasepassword=ambari --databaseport=5432 --database=postgres -s I guess these above steps may resolve your issue.

Online	Offline
Last Visited	‎10-22-2025 08:00 PM

Member Since	‎11-12-2018 10:00 AM
Last Visited	‎10-22-2025 08:00 PM
Posts	218
Kudos received	179

Cloudera Community

Re: Migrating workloads from Spark 2 to Spark 3

Re: Looking for a supported version of Spark 3 for...

Re: Spark 3 Parcel Compatibility with CDP Private ...

Re: Apache Storm support in Cloudera

Re: Complete example for using spark MLlib for twi...

Re: datanode machine + how many datanode we can ad...

Re: Install and configure new Ambari server for ex...

Re: how to disable LDAP authentication in Knox

Re: Map reduce Flow clarification

Re: how to migrate hive partitioned db to new clus...

Re: Changing port 50070

Re: How to configure hive so beeline does not ask ...

Re: Unable to view/see the filename correctly when...

Re: Unable to view/see the filename correctly when...

Re: Install and configure new Ambari server for ex...