About nsabharwal

nsabharwal · ‎01-24-2016

@Michel Sumbul Very good question ..Please see this to start with http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 It was 3 years ago. http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

nsabharwal · ‎01-24-2016

nsabharwal · ‎01-24-2016

@Pavel Hladík Couple of things: Take a look on the following matrix 1) HDP 2.1 is deprecated and it means that you may not get HDP 2.0 support You cannot go beyond Ambari 2.1.2 without upgrading HDP. It look like that you have having issue with History server (Introduced in HDP 2.0) There is high probability that you may end up hitting more issues because of this mismatch.

nsabharwal · ‎01-23-2016

Node labels enable you partition a cluster into sub-clusters so that jobs can be run on nodes with specific characteristics. For example, you can use node labels to run memory-intensive jobs only on nodes with a larger amount of RAM. Node labels can be assigned to cluster nodes, and specified as exclusive or shareable. You can then associate node labels with capacity scheduler queues. Each node can have only one node label. Demo: Use case 2 node labels : node1 & node2 + Default & Spark queue Submit job to node1 Node labels added : yarn rmadmin -addToClusterNodeLabels "node1(exclusive=true),node2(exclusive=false)" Label assigned: yarn rmadmin -replaceLabelsOnNode "phdns02.cloud.hortonworks.com=node2 phdns01.cloud.hortonworks.com=node1" Job Submission: Job send to node1 only and assign to queue spark hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue spark -node_label_expression node1 Job send to node2 only and assign to queue spark hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue spark -node_label_expression node2 Job send to node1 only and assign to queue default hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue default -node_label_expression node1 More details - Doc link SlideShare

nsabharwal · ‎01-23-2016

@Michel Meulpolder Perect! 🙂 Do you have the link handy? Would love to read and upvote.

nsabharwal · ‎01-23-2016

@Ram D First of all, nice work on Node labels. We should connect..My contact information is in my profile. Re: Cache Very good article on the same topic http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/ Configuration for resources’ localization Administrators can control various things related to resource-localization by setting or changing certain configuration parameters in yarn-site.xml when starting a NodeManager. yarn.nodemanager.local-dirs: This is a comma separated list of local-directories that one can configure to be used for copying files during localization. The idea behind allowing multiple directories is to use multiple disks for localization – it helps both fail-over (one/few disk(s) going bad doesn’t affect all containers) and load balancing (no single disk is bottlenecked with writes). Thus, individual directories should be configured if possible on different local disks. yarn.nodemanager.local-cache.max-files-per-directory: Limits the maximum number of files which will be localized in each of the localization directories (separately for PUBLIC / PRIVATE / APPLICATION resources). Its default value is 8192 and should not typically be assigned a large value (configure a value which is sufficiently less than the per directory maximum file limit of the underlying file-system e.g ext3). yarn.nodemanager.localizer.address: The network address where ResourceLocalizationService listens to for various localizers. yarn.nodemanager.localizer.client.thread-count: Limits the number of RPC threads in ResourceLocalizationService that are used for handling localization requests from Localizers. Defaults to 5, which means that by default at any point of time, only 5 Localizers will be processed while others wait in the RPC queues. yarn.nodemanager.localizer.fetch.thread-count: Configures the number of threads used for localizing PUBLIC resources. Recall that localization of PUBLIC resources happens inside the NodeManager address space and thus this property limits how many threads will be spawned inside NodeManager for localization of PUBLIC resources. Defaults to 4. yarn.nodemanager.delete.thread-count: Controls the number of threads used by DeletionService for deleting files. This DeletionUser is used all over the NodeManager for deleting log files as well as local cache files. Defaults to 4. yarn.nodemanager.localizer.cache.target-size-mb: This decides the maximum disk space to be used for localizing resources. (At present there is no individual limit for PRIVATE / APPLICATION / PUBLIC cache. YARN-882). Once the total disk size of the cache exceeds this then Deletion service will try to remove files which are not used by any running containers. At present there is no limit (quota) for user cache / public cache / private cache. This limit is applicable to all the disks as a total and is not based on per disk basis. yarn.nodemanager.localizer.cache.cleanup.interval-ms: After this interval resource localization service will try to delete the unused resources if total cache size exceeds the configured max-size. Unused resources are those resources which are not referenced by any running container. Every time container requests a resource, container is added into the resources’ reference list. It will remain there until container finishes avoiding accidental deletion of this resource. As a part of container resource cleanup (when container finishes) container will be removed from resources’ reference list. That is why when reference count drops to zero it is an ideal candidate for deletion. The resources will be deleted on LRU basis until current cache size drops below target size.

nsabharwal · ‎01-23-2016

@Ancil McBarnett Impressive response!!!

nsabharwal · ‎01-23-2016

@Alessio Ubaldi Just to test ...can you run the query withouth vectorization? set hive.vectorized.execution.enabled=false

nsabharwal · ‎01-23-2016

Yes please... @Robin Dong

nsabharwal · ‎01-23-2016

@Robin Dong I am glad it helped. You can accept one the best answers to close the thread or we can close thread by assuming both answers are good. "Best practice"

Online	Offline
Last Visited	‎07-18-2019 05:10 PM

Member Since	‎09-18-2015 05:49 PM
Last Visited	‎07-18-2019 05:10 PM
Posts	3,274
Kudos received	1129

Cloudera Community

Re: Is Ranger KMS Encryption FIPS 140-2 compliant ...

Re: How to add another HiveServer for current meta...

Re: FQDNs - are they necessary?

Re: java.io.FileNotFoundException: (Is a director...

Re: Need Design/Architecture Suggestion on HDP & H...

Re: HDFS Compression vs Performance

HDFS quota: Is there a GUI to control HDFS Name an...

Re: Ambari change permission of /mr-history folder...

Yarn Node Labels

Re: Hive query stalls on hive cli

Re: Where is cache location for submitted applica...

Re: What tool is the best tool for extract data fr...

Re: Hive query issue

Re: Hortonworks installation

Re: Hortonworks installation