About kgautam

kgautam · ‎09-10-2018

1. Can you please loging to the host hosting namenode. 2. id UserName : shows the group the user is pointed to . Do you see the group present in Ranger for the user. There is also a possibility that LDAP is configured directly, and the grousp are being pulled from LDAP.

kgautam · ‎09-10-2018

Try setting the parameters before running the query. Play with set mapred.min.split.size=100000000; and max.split.size for optimal performance. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=100000000;

kgautam · ‎09-10-2018

Kindly find the heap sizing of all the services for 80 Node HDP cluster (HDP-2.6.4). https://community.hortonworks.com/content/kbentry/209792/tuning-heap-sizing-for-a-80-node-hdp-cluster.html https://community.hortonworks.com/articles/209789/making-hiveserver2-instance-handle-more-than-500-c.html

kgautam · ‎08-31-2018

Thanks to Christoph Gutsfeld, Matthias von Görbitz and Rene Pajta for all their valuable pointers for writing this article. The article provides a indetailed and thorugh understanding of Hive LLAP. Understanding YARN YARN is essentially a system for managing distributed applications. Itconsists of a central Resource manager, which arbitrates all available cluster resources, and a per-node Node Manager, which takes direction from the Resource manager. Resource Manager and node Manager follow a master slave relationship. The Node manager is responsible for managing available resources on a single node. Yarn defines a unit of work in terms of container.. It is available in each node. Application Master negotiates container with the scheduler(one of the component of Resource Manager). Containers are launched by Node Manager Understanding YARN Memory configuration Memory allocated for all YARN container on a node : Total amount of memory that can be used by Node manager on every node for allocating containers. Minimum container size : minimum amount of RAM that will be allocated to a requested container. Any container requested will be allocated memory in multiple of the Minimum container size. Maximum container size : The max amount of RAM that can be allocated to a single container. Maximum container size <= Memory allocated for all YARN container on a node LLAP Daemons run as yarn container hence LLAP daemon size should be >= Minimum container size but <= Maximum container size Understanding CPU Memory configuration Percentage of physical CPU allocated for all containers on a node : X% of the total cpu that can be used by the containers. The value should never be 100% as cpu is needed by data Nodes, Node Manager and OS. Minimum Container Vcores: minimum number of cpu that will be allocated to a given container. Maximum Container Vcore: maximum number of Vcpu that can be allocated to a container. CPU isolation : this enables c-groups, enforcing containers to use exactly the number of CPU allocated to them. If this option is disabled then a container is free to occupy all the CPUs available on the machine. LLAP daemon run as a big YARN container hence always ensure that Maximum Container Size Vcore is set equal to number of Vcores available to run YARN Container ( 80% of total number of CPU available on that host). If CPU isolation is enabled it becomes even more important to set Maximum Container Size Vcore to its appropriate value Hive LLAP Architecture https://cwiki.apache.org/confluence/display/Hive/LLAP known as Live Long and Process, LLAP provides a hybrid execution model. It consists of a long-lived daemon which replaces direct interactions with the HDFS Data Node, and a tightly integrated DAG-based framework. Functionality such as caching, pre-fetching, some query processing and access control are moved into the daemon. Small/short queries are largely processed by this daemon directly, while any heavy lifting will be performed in standard YARN containers. Similar to the Data Node, LLAP daemons can be used by other applications as well, especially if a relational view on the data is preferred over file-centric processing. The daemon is also open through optional APIs (e.g., Input Format) that can be leveraged by other data processing frameworks as a building block. Hive LLAP consists of the following component Hive Interactive Server : Thrift server which provide JDBC interface to connect to the Hive LLAP. Slider AM : The slider application which spawns, monitor and maintains the LLAP daemons. TEZ AM query coordinator : TEZ Am which accepts the incoming the request of the user and execute them in executors available inside the LLAP daemons (JVM). LLAP daemons : To facilitate caching and JIT optimization, and to eliminate most of the startup costs, a daemon runs on the worker nodes on the cluster. The daemon handles I/O, caching, and query fragment execution. LLAP configuration in details Component Parameter Conf Section of Hive Rule and comments SliderSize slider_am_container_mb hive-interactive-env =yarn.scheduler.minimum-allocation-mb Tez AM coordinator Size tez.am.resource.memory.mb tez-interactive-site =yarn.scheduler.minimum-allocation-mb Number of Cordinators hive.server2.tez.sessions.per.default.queue Settings Number of Concurrent Queries LLAP support. This will result in spawning equal number of TEZ AM. LLAP DaemonSize hive.llap.daemon.yarn.container.mb hive-interactive-site yarn.scheduler.minimum-allocation-mb <= Daemon Size <= yarn.scheduler.maximu-allocation-mb. Rule of thumb always set it to yarn.scheduler.maximu-allocation-mb. Number of Daemon num_llap_nodes_for_llap_daemons hive-interactive-env Number of LLAP Daemons running Number of Daemons num_llap_nodes_for_llap_daemons hive-interactive-env Number of LLAP Daemons running. This will determine total Cache and executors available to run any query on LLAP ExecutorSize hive.tez.container.size hive-interactive-site 4 – 6 GB is the recommended value. For each executor you need to allocate one VCPU Number of Executor hive.llap.daemon.num.executors Determined by number of “Maximum VCore in YARN” LLAP Daemon configuration in details Component PARAMETER NAME SECTION Rule and comments Maximum YARN container Size yarn.scheduler.maximu-allocation-mb. YARN settings This is the maximum amount of memory a Conatiner can be allocated with. Its Recommended to RUN LLAP daemon as a big Container on a node DaemonSize hive.llap.daemon.yarn.container.mb hive-interactive-site yarn.scheduler.minimum-allocation-mb <= Daemon Size <= yarn.scheduler.maximu-allocation-mb. Rule of thumb always set it to yarn.scheduler.maximu-allocation-mb. Headroom llap_headroom_space hive-interactive-env MIN (5% of DaemonSize or 6 GB). Its off heap But part of LLAP Daemon HeapSize llap_heap_size hive-interactive-env Number of Executor \* hive.tez.container.size Cache Size hive.llap.io.memory.size hive-interactive-site DaemonSize - HeapSize – Headroom. Its off heap but part of LLAP daemon LLAP Queue Size Slider Am Size + Number of Tez Conatiners \* hive.tez.container.size + Size of LLAP Daemon \* Number of LLAP Daemons LLAP on YARN LLAP Interactive Query Configuration. LLAP YARN Queue Configuration Key configurations to set 1.User Limit Factor =1 2.Capacity and Max capacity = 100 LLAP Daemon in Detail Sizing Rule of Thumb Parameter Tuning CACHE Managing LLAP through Command line Utility (Flex) List Slider jobs : slider list List Slider status : slider status slider-application-name ( llap0) List Diagnostic Status of a Slider App : slider diagnostics --application – name slider-application-name (llap0) –verbose. Scale down LLAP daemon : slider flex slider-application-name (llap0) --component LLAP -1 Scale up a new LLAP daemon : slider flex slider-application-name (llap0) --component LLAP +1. To stop Slider App : Slider stop slider-application-name (llap0) Trouble Shooting : Finding Which Hosts the LLAP daemons are running Ambari -> Hive -> HiveServer2 Interactive UI -> Running Instances Behavior 1. In HDP 2.6.4 preemption of queries is not supported. 2. If multiple concurrent queries have exhausted the queue then any incoming query will bin waiting state. 3. All Queries running on Hive LLAP can be seen in the TEZ UI.

kgautam · ‎08-30-2018

Why to use spark-submit if spark-shell is there ? 1. Spark shell spawns executors on random nodes and hence chances of data Locality will be very less. 2. Spark-submit based on the Nodes where the data is saved spawns the excutors hence spark-submit will be more performant as compared to spark-shell. 3. Spark shell is good in situations when data exploration needs to be done as it provides a interactive CLI to run your code.

kgautam · ‎08-30-2018

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP. https://hortonworks.com/apache/spark/ Spark code samples: http://bytepadding.com/spark/ Take Away 1. Spark is a library and not a service. 2. Spark interacts with multiple services like HDFS and YARN to process data. 3. Spark client also has YARN client wrapped within it. 4. Spark can be configured to run both locally and on the cluster 5. Spark context is the entry point to interact with a Spark process. 6. Spark is a JVM based execution engine. Take Away 1. Each line of your code is parsed to prepare a spark Plan. 2. sc.TextFile => results in fetching the metaInfo from Name Node of where are the file Blocks are located and requesting YARN for containers on those host. Text File also provides the information about the record delimeter used ( new line character in case of Text). 3. The transformations are all grouped together in a Task. The transformation are serialized on driver and send to the executors. Do appreciate all transformation and object creation happens on Driver and subsequently sent to executors. 4. Reduce by results in Data Shuffling know as Stages in Spark. 5. saveAsTextFile interacts with Name Node to get information about where to save the file and saves the file on HDFS.

kgautam · ‎08-30-2018

Hadoop Distributed File System HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage. Take Away 1. HDFS is based on a master Slave Architecture with Name Node (NN) being the master and Data Nodes (DN) being the slaves. 2. Name Node stores only the meta Information about the files, actual data is stored in Data Node. 3. Both Name Node and Data Node are processes and not any super fancy Hardware. 4. The Data Node uses the underlying OS file System to save the data. 4. You need to use HDFS client to interact with HDFS. The hdfs clients always talks to Name Node for meta Info and subsequently talks to Data Nodes to read/write data. No Data IO happens through Name Node. 5. HDFS clients never send data to Name Node hence Name Node never becomes a bottleneck for any Data IO in the cluster 6. HDFS client has "short-circuit" feature enabled hence if the client is running on a Node hosting Data Node it can read the file from the Data Node making the complete read/write Local. 7. To even make it simple imagine HDFSclient is a web client and HDFS as whole is a web service which has predefined task to GET, PUT, COPYFROMLOCAL etc. How is a 400 MB file Saved on HDFS with hdfs block size of 100 MB. The diagram shows how first block is saved. In case of replication each block will be saved 3 on different Data Nodes. The meta info saved on Name Node (Replication Factor of 3 is used hence each block is saved thrice) Block Placement Strategy Place the first replica somewhere – either a random node (if the HDFS client is outside the Hadoop/DataNode cluster) or on the local node (if the HDFS client is running on a node where data Node is running. "short-circuit" optimization). Place the second replica in a different rack. (This ensures if power supply of one rock goes down still the block can be read from other rack.) Place the third replica in the same rack as the second replica. ( This ensures in case a yarn container can be allocated on a give host, the data will be served from a host in the same rack. Data transfer in same rack is faster as compared to across rack ) If there are more replicas – spread them across the rest of the racks. YARN (Yet Another Resource Negotiator ) "does it ring a bell 'Yet Another Hierarchically Organized Oracle' YAHOO" YARN is essentially a system for managing distributed applications. It consists of a central Resource manager (RM), which arbitrates all available cluster resources, and a per-node Node Manager (NM), which takes direction from the Resource manager. The Node manager is responsible for managing available resources on a single node. http://hortonworks.com/hadoop/yarn/ Take Away 1. YARN is based on a master Slave Architecture with Resource Manager being the master and Node Manager being the slaves. 2. Resource Manager keeps the meta info about which jobs are running on which Node Manage and how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster. 3. The jobs run on the Node Manager and jobs never get execute on Resource Manager. Hence RM never becomes a bottleneck for any job execution. Both RM and NM are processes and not some fancy hardware 4. Container is logical abstraction for CPU and RAM. 5. YARN (Yet Another Resource Negotiator) is scheduling container (CPU and RAM ) over the whole cluster. Hence for end user if he needs CPU and RAM in the cluster it needs to interact with YARN 6. While Requesting for CPU and RAM you can specify the Host one which you need it. 7. To interact with YARN you need to use yarn-client which How HDFS and YARN work in TANDEM 1. Name Node and Resource Manager process are hosted on two different host. As they hold key meta information. 2. The Data Node and Node manager processes are co-located on same host. 3. A file is saved onto HDFS (Data Nodes) and to access a file in Distributed way one can write a YARN Application (MR2, SPARK, Distributed Shell, Slider Application) using YARN client and to read data use HDFSclient. 4. The Distributed application can fetch file location ( meta info From Name Node ) ask Resource Manager (YARN) to provide containers on the hosts which hold the file blocks. 5. Do remember the short-circuit optimization provided by HDFS, hence if the Distributed job gets a container on a host which host the file block and tries to read it, the read will be local and not over the network. 6. The same file If read sequentially would have taken 4 sec (100 MB/sec speed) can be read in 1 second as Distributed process is running parallely on different YARN container( Node Manager) and reading 100 MB/sec *4 in 1 second.

kgautam · ‎08-30-2018

check_mk is what most use. It is easy to configure provides you with a Nice UI with history saved. The check_mk agents consume very less CPU and RAM hence avoiding any kind of any negative impact on any other application running on the Host.

kgautam · ‎08-18-2018

Cluster In Solr, a cluster is a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit. A cluster may contain many collections. See also SolrCloud. Collection In Solr, one or more Documents grouped together in a single logical index using a single configuration and Schema. In SolrCloud a collection may be divided up into multiple logical shards, which may in turn be distributed across many nodes, or in a Single node Solr installation, a collection may be a single Core. Commit To make document changes permanent in the index. In the case of added documents, they would be searchable after a commit. Core An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also SolrCloud. Key Take Away 1. Solr works on a non master-slave architecture, every solr node is master of its own. Solr nodes uses Zookeper to learn about the state of the cluster. 2. A solr Node (JVM) can host multiple core 3. Core is the place where Lucene (Index) engine is running. Every core has its own Lucene engine 4. A collection will be divided in shards. 5. A shard will be represented as core (A part of JVM) in the Solr Node (JVM) 6. Every solr node keeps sending heartbeat to Zookeeper to inform about its availability. 7. Usage of Local FS provides the most stable and best IO for solr. 8. A replication factor of 2 should be maintained on local mode to avoid any data loss. 9. Do remember every replication will have a core attached to it and also space is disk. 10. If a collection is divided into 3 shards with replication factor of 3 : in total 9 cores will be hosted across the solr nodes. Data saved on local fs will be 3X 11. Solr node doesnt publish data to ambari metrics by default. A solr metric process ( a seperate process that solr node) needs to be run on every node where solr node is hosted. The metric process fetches data from solr node and pushes to ambari metrics. Solr on HDFS 1. Solr node should be colocated with data nodes for best performance. 2. Because of DataNodes are used used by Spark, Hbase this setup can result into unstable Solr Cloud easily. 3. Because of heavy CPU consumption on data nodes solr nodes can loose to establish heart beat connection to zookeeper resulting in the solr node being removed from solr cluster. 4. Watch for solr logs to make sure short-circuit writes are being used. 5. At collection level you are compelled to use replication factor of 2 else a restart of one node will result in the collection being unavailable. 6. Replication of 2 at collection level and Replication Factor of 3 at HDFS can significantly impact the Write peroformance. 7. Ensure the RF of the Sole HDFS directory is set to 1. Lucene Indexing on single Core Pic taken from : https://www.quora.com/What-is-the-internal-architecture-of-Apache-solr Reference : https://lucene.apache.org/solr/guide/6_6/solr-glossary.html#solrclouddef

kgautam · ‎08-10-2018

Can you please check the jar that you have added has some classes that clash with the classes of hive. 1. I can see jersey and eclipse jetty. 2. can you please ensure the jar version of these jars (and any shared jar) is the same with what is present in your platform

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎10-02-2017 07:17 AM
Last Visited	‎08-14-2019 06:45 PM
Posts	112
Kudos received	70

Cloudera Community

Re: Accessing HBase Table through Hive is very slo...

Re: Yarn job stuck at Accepted state in a kerberiz...

Re: How to create a hadoop custom inputformat/file...

Re: Can columnar format occupy more space than row...

Re: Handling compression in Spark

Re: Ranger policy for group not working...Checked ...

Re: Simple hive query on small dataset taking over...

Re: Java Heap size recommendation for Hiveserver2 ...

Hive LLAP deep dive

Spark Shell In Action

Understanding and Visualizing Spark Framework

Understanding basics of HDFS and YARN

Re: what is the most best monitoring tool for hado...

Understanding Solr Architecture and Best practices

Re: Adding jar causing NullPointerException