Member since
09-28-2015
48
Posts
117
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3699 | 07-05-2017 04:37 PM | |
1384 | 07-07-2016 03:40 AM | |
1983 | 04-28-2016 12:54 PM | |
2565 | 04-13-2016 02:32 AM | |
1612 | 04-11-2016 08:41 PM |
07-28-2016
04:07 AM
1 Kudo
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
07-07-2016
11:25 PM
7 Kudos
Overview I recently encountered a question where someone asked to see how you can do preemption across YARN queues when a spark job is beyond it's queue's min guarantee. They had seen this before with the Fair Scheduler and Map Reduce, but wanted to apply the same experience here but with Spark and the Capacity Scheduler. This how-to article describes how to setup this experience.
Goal: Run large
Spark jobs in two separate capacity queues to produce an equal share of
resources for both jobs.
Hardware: 5 Nodes
of AWS EC2 r3.xlarge
Cluster
Configuration: HDP: 2.4.2, Spark: 1.6.1, 5 Node Managers, 20GB (20480MB) Yarn Containers yarn.scheduler.maximum-allocation-mb
= 20480 yarn.scheduler.minimum-allocation-mb
= 2560
High Level Setup: 1. Add preemption properties as per documentation 2. Create Two YARN Queues with Fair Ordering Child queue “test1” with a min capacity of 50%
and a max of 100% Child queue “test2” with a min capacity of 50%
and a max of 100% Root queue with a fair ordering policy 3. Run Spark jobs Run Spark job on test1 with a max size container
for as many spark executors as possible Run Spark job on test2 with a max size
containers using dynamic resource allocation
1) Add YARN preemption
properties The
following parameters should be applied to the yarn-site.xml file. This can be done manually or through
Ambari. These are the default preemption properties as provided per Hortonworks documentation. The following YARN
Preemption Parameters Applied should be applied to yarn-site.xml: yarn.resourcemanager.scheduler.monitor.enable=true
yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Option 1: Manual Backup /etc/hadoop/conf/yarn-site.xml Update the /etc/hadoop/conf/yarn-site.xml with the following
parameters. Note: You must put these settings in an xml format. Restart YARN Option 2: Ambari To do this in Ambari,
follow the instructions below: The following parameters were added to yarn-site.xml,
which can be done thru Ambari -> Yarn -> Config. You can turn preemption on in the Settings
tab. This will set yarn.resourcemanager.scheduler.monitor.enable=true. The remaining properties need to be added in the Advanced
config tab in Ambari under “Custom yarn-site”.
Click “Add Property”. Then add
the following properties: yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Restart
YARN 2) Create Two YARN
Queues with Fair Ordering The
following parameters are then added to the capacity-scheduler.xml file. You can do this manually or through the
Ambari View – Yarn Queue Manager. yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=test1,test2
yarn.scheduler.capacity.root.ordering-policy=fair
yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight=true
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.test1.acl_submit_applications=*
yarn.scheduler.capacity.root.test1.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test1.maximum-capacity=100
yarn.scheduler.capacity.root.test1.user-limit-factor=1
yarn.scheduler.capacity.root.test1.state=RUNNING
yarn.scheduler.capacity.root.test1.capacity=50
yarn.scheduler.capacity.root.test1.ordering-policy=fifo
yarn.scheduler.capacity.root.test2.acl_administer_queue=*
yarn.scheduler.capacity.root.test2.acl_submit_applications=*
yarn.scheduler.capacity.root.test2.minimum-user-limit-percent=100
yarn.scheduler.capacity.root.test2.maximum-capacity=100
yarn.scheduler.capacity.root.test2.user-limit-factor=1
yarn.scheduler.capacity.root.test2.state=RUNNING
yarn.scheduler.capacity.root.test2.capacity=50
yarn.scheduler.capacity.root.test2.ordering-policy=fifo Option 1: Manually Backup
the original file: /etc/hadoop/conf/capacity-scheduler.xml Update
the file with settings above: /etc/hadoop/conf/capacity-scheduler.xml Run
the following command to refresh the queues from the /etc/hadoop/conf directory yarn
rmadim –refreshQueues Option 2: Ambari View Using
the YARN Queue Manager in Ambari, you can also apply the following settings to
the capacity-scheduler.xml using the GUI. Set
up the YARN queues as follows. Both the
test1 and test2 queues should look exactly the same and rollup to the root
queue. The
test1 and test2 queues should have the same configuration as below: The
root queue should have an Ordering Policy of Fair and Enable Sized Based Weight
Ordering. Restart YARN 3) Run Spark Jobs sudo su - hdfscd
/usr/hdp/current/spark-client Run the following Spark job and make sure it runs
over-capacity on the test1 queue. Notice how we specify 5 executors and large containers. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test1 --num-executors 5 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Confirm in the Resource Manager
UI (http://resource-manager-node:8088/cluster)
that it’s running over-capacity in test1 queue Run a second Spark job on
test2 queue. Notice how this job does not specify the number of executors. That's because we are using Dynamic Resource Allocation in Spark which became available in Spark 1.6. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test2 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Initially
you should see the following behavior in the resource manager: And
then…viola - In a few seconds, YARN will preempt and the 2 nd Spark
job will take some containers from the first job so that you have a fair
balance of resources across a root queue.
... View more
Labels:
07-07-2016
03:40 AM
1 Kudo
There's a JIRA and PR already out there for this. You might want to track it here. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-987?jql=project%20%3D%20ZEPPELIN and Pull Request: https://github.com/apache/zeppelin/pull/993
... View more
05-12-2016
12:06 PM
@drussell Thanks. I saw this list too but I'm not sure if it's least access. I can see places where you can create conditions in the policy so that you can only work in a specific VPC, such as: "Condition": {
"StringEquals": {
"ec2:vpc": "arn:aws:ec2:us-east-1:############:vpc/vpc-XXXXXX"
} Or even get really specific on resources. So instead of using "Resource":"*" in the policy, you can get it down to the instances in a certain availability zone. For example: "Resource": "arn:aws:ec2:us-east-1::instance/*" The list of roles are good but the resources and conditions on these roles are just as important. Especially as we discuss with Info Sec team and justify why.
... View more
05-12-2016
04:15 AM
2 Kudos
Many financial and highly regulated institutions use least privilege access models when rolling out software features to end users. With Cloudbreak, what is the least privilege access model that you would have to supply in AWS to make this work? Additionally, how can you ensure that the IAM accounts can manage their own VPC deployment while still ensuring that they can't change configuration for other VPC's using the same account.
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
04-28-2016
12:54 PM
2 Kudos
Alex – This makes sense to me. If you’re tailing files or listening to
syslog, NiFi has great easy processors that can easily forward this information
to a downstream search tool (SOLR, ES, etc) or even persist the information in
long term storage (HDFS). You can
encrypt and compress the data as you capture it, send it over a secure wire and
do the simple event processing you need in order to route the information to
the appropriate end point. There are
also some processors such as ScanContent or RouteContent that can allow you to
route message based on patterns (think Regex and Whitelists) that you find in the
message payload (i.e. route errors here, info there) or create priorities for
those messages. The other place where NiFi helps tremendously is around data
conversion. For example, convert AvroToJSON or CSVtoAvro or
AttributesToJSON. These help you get
messages into the proper files streams to be indexed by your search tool. The one place I would look at closely is the amount of log
parsing you need to do. For unique
formats, you may need to create a custom processor in NiFi to assist you with
extracting log attributes. There are
processors such as EvaluateXPath or EvaluateXQuery that allow you to use XPath
to find attribute value pair information in XML and JSON which is extremely helpful
and may be all you need. Otherwise, it’s
really easy to get started and play around with your use case to see if there’s
a fit.
... View more
04-19-2016
04:09 AM
5 Kudos
This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514.
Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp
Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf
Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist
sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings:
Protocol: UDP
Port: 1514
Local Network Interface: lo0
Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message
... View more
Labels:
04-13-2016
02:32 AM
4 Kudos
Hi Babu - It's more of a common approach to write out a new file. HDFS is essentially an append only system so creating a new file that's a derivative of the original is a very common practice. You can write a MR program to output a file or use a Hive query to output a query results to a new file. For example, INSERT OVERWRITE DIRECTORY '/user/me/output' SELECT UPPER(myColumn) FROM myTable. This would create a new file(s) with a modified change that's like an update. In this case, we're upper casing the 'myColumn' in the myTable table.
... View more
04-11-2016
08:41 PM
2 Kudos
You should use the Kylin UI for the operations and the cube build. This includes selecting tables, defining dimensions and deploying the cube build. For querying, you should use the Kylin driver for access to query the cubes. You can grab that driver here: http://kylin.apache.org/docs15/tutorial/odbc.html. You can use that driver with your BI tools (Tableau, PowerBI, etc). MDX is not widely used so I don't believe it is supported with the Kylin.
... View more
03-25-2016
07:01 PM
1 Kudo
If you don’t have important data in Cloudbreak, you can drop the db using root and do a... cbd delete Switch back to the cloudbreak user and Edit the ‘Profile’ file in your cloud break deployment directory (i.e. /var/lib/cloudbreak-deployment). Then export the username and password you would like to use as the default user. export UAA_DEFAULT_USER_EMAIL=admin@example.com export UAA_DEFAULT_USER_PW=mypass This will override the default settings. Then... cbd init cbd start On startup, your default settings should be displayed as being changed.
... View more
Labels: