Member since
09-18-2015
3274
Posts
1159
Kudos Received
426
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2627 | 11-01-2016 05:43 PM | |
| 8767 | 11-01-2016 05:36 PM | |
| 4943 | 07-01-2016 03:20 PM | |
| 8274 | 05-25-2016 11:36 AM | |
| 4439 | 05-24-2016 05:27 PM |
01-24-2016
02:50 PM
@Michel Sumbul Very good question ..Please see this to start with http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 It was 3 years ago. http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
... View more
01-24-2016
01:36 PM
Labels:
- Labels:
-
Apache Hadoop
01-24-2016
01:06 PM
@Pavel Hladík Couple of things: Take a look on the following matrix 1) HDP 2.1 is deprecated and it means that you may not get HDP 2.0 support You cannot go beyond Ambari 2.1.2 without upgrading HDP. It look like that you have having issue with History server (Introduced in HDP 2.0) There is high probability that you may end up hitting more issues because of this mismatch.
... View more
01-23-2016
11:30 PM
3 Kudos
Node labels enable you partition a cluster into sub-clusters so that jobs can be run on nodes with specific characteristics. For example, you can use node labels to run memory-intensive jobs only on nodes with a larger amount of RAM. Node labels can be assigned to cluster nodes, and specified as exclusive or shareable. You can then associate node labels with capacity scheduler queues. Each node can have only one node label.
Demo:
Use case
2 node labels : node1 & node2 + Default & Spark queue
Submit job to node1
Node labels added : yarn rmadmin -addToClusterNodeLabels "node1(exclusive=true),node2(exclusive=false)"
Label assigned: yarn rmadmin -replaceLabelsOnNode "phdns02.cloud.hortonworks.com=node2 phdns01.cloud.hortonworks.com=node1"
Job Submission:
Job send to node1 only and assign to queue spark
hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue spark -node_label_expression node1
Job send to node2 only and assign to queue spark
hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue spark -node_label_expression node2
Job send to node1 only and assign to queue default
hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar -shell_command "sleep 100" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar-queue default -node_label_expression node1
More details - Doc link
SlideShare
... View more
Labels:
01-23-2016
09:22 PM
@Michel Meulpolder Perect! 🙂 Do you have the link handy? Would love to read and upvote.
... View more
01-23-2016
08:29 PM
1 Kudo
@Ram D First of all, nice work on Node labels. We should connect..My contact information is in my profile. Re: Cache Very good article on the same topic http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/ Configuration for resources’ localization Administrators can control various things related to resource-localization by setting or changing certain configuration parameters in yarn-site.xml when starting a NodeManager.
yarn.nodemanager.local-dirs: This is a comma separated list of local-directories that one can configure to be used for copying files during localization. The idea behind allowing multiple directories is to use multiple disks for localization – it helps both fail-over (one/few disk(s) going bad doesn’t affect all containers) and load balancing (no single disk is bottlenecked with writes). Thus, individual directories should be configured if possible on different local disks. yarn.nodemanager.local-cache.max-files-per-directory: Limits the maximum number of files which will be localized in each of the localization directories (separately for PUBLIC / PRIVATE / APPLICATION resources). Its default value is 8192 and should not typically be assigned a large value (configure a value which is sufficiently less than the per directory maximum file limit of the underlying file-system e.g ext3). yarn.nodemanager.localizer.address: The network address where ResourceLocalizationService listens to for various localizers. yarn.nodemanager.localizer.client.thread-count: Limits the number of RPC threads in ResourceLocalizationService that are used for handling localization requests from Localizers. Defaults to 5, which means that by default at any point of time, only 5 Localizers will be processed while others wait in the RPC queues. yarn.nodemanager.localizer.fetch.thread-count: Configures the number of threads used for localizing PUBLIC resources. Recall that localization of PUBLIC resources happens inside the NodeManager address space and thus this property limits how many threads will be spawned inside NodeManager for localization of PUBLIC resources. Defaults to 4. yarn.nodemanager.delete.thread-count: Controls the number of threads used by DeletionService for deleting files. This DeletionUser is used all over the NodeManager for deleting log files as well as local cache files. Defaults to 4. yarn.nodemanager.localizer.cache.target-size-mb: This decides the maximum disk space to be used for localizing resources. (At present there is no individual limit for PRIVATE / APPLICATION / PUBLIC cache. YARN-882). Once the total disk size of the cache exceeds this then Deletion service will try to remove files which are not used by any running containers. At present there is no limit (quota) for user cache / public cache / private cache. This limit is applicable to all the disks as a total and is not based on per disk basis. yarn.nodemanager.localizer.cache.cleanup.interval-ms: After this interval resource localization service will try to delete the unused resources if total cache size exceeds the configured max-size. Unused resources are those resources which are not referenced by any running container. Every time container requests a resource, container is added into the resources’ reference list. It will remain there until container finishes avoiding accidental deletion of this resource. As a part of container resource cleanup (when container finishes) container will be removed from resources’ reference list. That is why when reference count drops to zero it is an ideal candidate for deletion. The resources will be deleted on LRU basis until current cache size drops below target size.
... View more
01-23-2016
07:51 PM
@Ancil McBarnett Impressive response!!!
... View more
01-23-2016
06:33 PM
@Alessio Ubaldi Just to test ...can you run the query withouth vectorization? set hive.vectorized.execution.enabled=false
... View more
01-23-2016
05:10 PM
1 Kudo
@Robin Dong I am glad it helped. You can accept one the best answers to close the thread or we can close thread by assuming both answers are good. "Best practice"
... View more