About ccasano

ccasano · ‎07-28-2016

ccasano · ‎07-07-2016

Overview I recently encountered a question where someone asked to see how you can do preemption across YARN queues when a spark job is beyond it's queue's min guarantee. They had seen this before with the Fair Scheduler and Map Reduce, but wanted to apply the same experience here but with Spark and the Capacity Scheduler. This how-to article describes how to setup this experience. Goal: Run large Spark jobs in two separate capacity queues to produce an equal share of resources for both jobs. Hardware: 5 Nodes of AWS EC2 r3.xlarge Cluster Configuration: HDP: 2.4.2, Spark: 1.6.1, 5 Node Managers, 20GB (20480MB) Yarn Containers yarn.scheduler.maximum-allocation-mb = 20480 yarn.scheduler.minimum-allocation-mb = 2560 High Level Setup: 1. Add preemption properties as per documentation 2. Create Two YARN Queues with Fair Ordering Child queue “test1” with a min capacity of 50% and a max of 100% Child queue “test2” with a min capacity of 50% and a max of 100% Root queue with a fair ordering policy 3. Run Spark jobs Run Spark job on test1 with a max size container for as many spark executors as possible Run Spark job on test2 with a max size containers using dynamic resource allocation 1) Add YARN preemption properties The following parameters should be applied to the yarn-site.xml file. This can be done manually or through Ambari. These are the default preemption properties as provided per Hortonworks documentation. The following YARN Preemption Parameters Applied should be applied to yarn-site.xml: yarn.resourcemanager.scheduler.monitor.enable=true yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000 yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Option 1: Manual Backup /etc/hadoop/conf/yarn-site.xml Update the /etc/hadoop/conf/yarn-site.xml with the following parameters. Note: You must put these settings in an xml format. Restart YARN Option 2: Ambari To do this in Ambari, follow the instructions below: The following parameters were added to yarn-site.xml, which can be done thru Ambari -> Yarn -> Config. You can turn preemption on in the Settings tab. This will set yarn.resourcemanager.scheduler.monitor.enable=true. The remaining properties need to be added in the Advanced config tab in Ambari under “Custom yarn-site”. Click “Add Property”. Then add the following properties: yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=3000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=15000 yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.1 Restart YARN 2) Create Two YARN Queues with Fair Ordering The following parameters are then added to the capacity-scheduler.xml file. You can do this manually or through the Ambari View – Yarn Queue Manager. yarn.scheduler.capacity.maximum-am-resource-percent=0.2 yarn.scheduler.capacity.maximum-applications=10000 yarn.scheduler.capacity.node-locality-delay=40 yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator yarn.scheduler.capacity.queue-mappings-override.enable=false yarn.scheduler.capacity.root.acl_administer_queue=* yarn.scheduler.capacity.root.capacity=100 yarn.scheduler.capacity.root.queues=test1,test2 yarn.scheduler.capacity.root.ordering-policy=fair yarn.scheduler.capacity.root.ordering-policy.fair.enable-size-based-weight=true yarn.scheduler.capacity.root.accessible-node-labels=* yarn.scheduler.capacity.root.test1.acl_submit_applications=* yarn.scheduler.capacity.root.test1.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.test1.maximum-capacity=100 yarn.scheduler.capacity.root.test1.user-limit-factor=1 yarn.scheduler.capacity.root.test1.state=RUNNING yarn.scheduler.capacity.root.test1.capacity=50 yarn.scheduler.capacity.root.test1.ordering-policy=fifo yarn.scheduler.capacity.root.test2.acl_administer_queue=* yarn.scheduler.capacity.root.test2.acl_submit_applications=* yarn.scheduler.capacity.root.test2.minimum-user-limit-percent=100 yarn.scheduler.capacity.root.test2.maximum-capacity=100 yarn.scheduler.capacity.root.test2.user-limit-factor=1 yarn.scheduler.capacity.root.test2.state=RUNNING yarn.scheduler.capacity.root.test2.capacity=50 yarn.scheduler.capacity.root.test2.ordering-policy=fifo Option 1: Manually Backup the original file: /etc/hadoop/conf/capacity-scheduler.xml Update the file with settings above: /etc/hadoop/conf/capacity-scheduler.xml Run the following command to refresh the queues from the /etc/hadoop/conf directory yarn rmadim –refreshQueues Option 2: Ambari View Using the YARN Queue Manager in Ambari, you can also apply the following settings to the capacity-scheduler.xml using the GUI. Set up the YARN queues as follows. Both the test1 and test2 queues should look exactly the same and rollup to the root queue. The test1 and test2 queues should have the same configuration as below: The root queue should have an Ordering Policy of Fair and Enable Sized Based Weight Ordering. Restart YARN 3) Run Spark Jobs sudo su - hdfscd /usr/hdp/current/spark-client Run the following Spark job and make sure it runs over-capacity on the test1 queue. Notice how we specify 5 executors and large containers. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test1 --num-executors 5 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Confirm in the Resource Manager UI (http://resource-manager-node:8088/cluster) that it’s running over-capacity in test1 queue Run a second Spark job on test2 queue. Notice how this job does not specify the number of executors. That's because we are using Dynamic Resource Allocation in Spark which became available in Spark 1.6. ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --queue test2 --executor-memory 18G --executor-cores 2 lib/spark-examples*.jar 1000000 Initially you should see the following behavior in the resource manager: And then…viola - In a few seconds, YARN will preempt and the 2 nd Spark job will take some containers from the first job so that you have a fair balance of resources across a root queue.

ccasano · ‎07-07-2016

There's a JIRA and PR already out there for this. You might want to track it here. JIRA: https://issues.apache.org/jira/browse/ZEPPELIN-987?jql=project%20%3D%20ZEPPELIN and Pull Request: https://github.com/apache/zeppelin/pull/993

ccasano · ‎05-12-2016

@drussell Thanks. I saw this list too but I'm not sure if it's least access. I can see places where you can create conditions in the policy so that you can only work in a specific VPC, such as: "Condition": { "StringEquals": { "ec2:vpc": "arn:aws:ec2:us-east-1:############:vpc/vpc-XXXXXX" } Or even get really specific on resources. So instead of using "Resource":"*" in the policy, you can get it down to the instances in a certain availability zone. For example: "Resource": "arn:aws:ec2:us-east-1::instance/*" The list of roles are good but the resources and conditions on these roles are just as important. Especially as we discuss with Info Sec team and justify why.

ccasano · ‎05-12-2016

Many financial and highly regulated institutions use least privilege access models when rolling out software features to end users. With Cloudbreak, what is the least privilege access model that you would have to supply in AWS to make this work? Additionally, how can you ensure that the IAM accounts can manage their own VPC deployment while still ensuring that they can't change configuration for other VPC's using the same account.

ccasano · ‎04-28-2016

Alex – This makes sense to me. If you’re tailing files or listening to syslog, NiFi has great easy processors that can easily forward this information to a downstream search tool (SOLR, ES, etc) or even persist the information in long term storage (HDFS). You can encrypt and compress the data as you capture it, send it over a secure wire and do the simple event processing you need in order to route the information to the appropriate end point. There are also some processors such as ScanContent or RouteContent that can allow you to route message based on patterns (think Regex and Whitelists) that you find in the message payload (i.e. route errors here, info there) or create priorities for those messages. The other place where NiFi helps tremendously is around data conversion. For example, convert AvroToJSON or CSVtoAvro or AttributesToJSON. These help you get messages into the proper files streams to be indexed by your search tool. The one place I would look at closely is the amount of log parsing you need to do. For unique formats, you may need to create a custom processor in NiFi to assist you with extracting log attributes. There are processors such as EvaluateXPath or EvaluateXQuery that allow you to use XPath to find attribute value pair information in XML and JSON which is extremely helpful and may be all you need. Otherwise, it’s really easy to get started and play around with your use case to see if there’s a fit.

ccasano · ‎04-19-2016

This was tested on Yosemite 10.10.5 1) Install NiFi on your MacOS: http://docs.hortonworks.com/HDPDocuments/HDF1/HDF-1.2/bk_HDF_InstallSetup/content/ch_HDF_installing.html 2) Setup your machine to foward syslog messages to port 1514. Backup your current syslog configuration. mv /etc/syslog.conf /etc/syslog.conf.bkp Edit your syslog.conf file to send all messages to UDP localhost port 1514 sudo vi /etc/syslog.conf Add the following entry to /etc/syslog.conf *.* @127.0.0.1:1514 Restart syslogd sudo launchctl unload /System/Library/LaunchDaemons/com.apple.syslogd.plist sudo launchctl load /System/Library/LaunchDaemons/com.apple.syslogd.plist Confirm syslogd is running. A result should display a process id (PID) for /usr/sbin/syslogd ps -ef | grep syslogd 3) Test with NiFi. Add a ListenSyslog processor to the canvas with following settings: Protocol: UDP Port: 1514 Local Network Interface: lo0 Connect the ListenSyslog process to an output port and have the relationship set to “success”. Start the ListenSyslog processor. You should see data get queued up and the Out statistics should show bytes flowing through the processor. Sometimes you need to help it along and send some messages to the syslogd server. If so, try typing this in the command line and then verify the data flowing in NiFi syslog -s test message

ccasano · ‎04-13-2016

Hi Babu - It's more of a common approach to write out a new file. HDFS is essentially an append only system so creating a new file that's a derivative of the original is a very common practice. You can write a MR program to output a file or use a Hive query to output a query results to a new file. For example, INSERT OVERWRITE DIRECTORY '/user/me/output' SELECT UPPER(myColumn) FROM myTable. This would create a new file(s) with a modified change that's like an update. In this case, we're upper casing the 'myColumn' in the myTable table.

ccasano · ‎04-11-2016

You should use the Kylin UI for the operations and the cube build. This includes selecting tables, defining dimensions and deploying the cube build. For querying, you should use the Kylin driver for access to query the cubes. You can grab that driver here: http://kylin.apache.org/docs15/tutorial/odbc.html. You can use that driver with your BI tools (Tableau, PowerBI, etc). MDX is not widely used so I don't believe it is supported with the Kylin.

ccasano · ‎03-25-2016

If you don’t have important data in Cloudbreak, you can drop the db using root and do a... cbd delete Switch back to the cloudbreak user and Edit the ‘Profile’ file in your cloud break deployment directory (i.e. /var/lib/cloudbreak-deployment). Then export the username and password you would like to use as the default user. export UAA_DEFAULT_USER_EMAIL=admin@example.com export UAA_DEFAULT_USER_PW=mypass This will override the default settings. Then... cbd init cbd start On startup, your default settings should be displayed as being changed.

Online	Offline
Last Visited	‎03-12-2019 11:28 AM

Member Since	‎09-28-2015 07:36 PM
Last Visited	‎03-12-2019 11:28 AM
Posts	48
Kudos received	106

Cloudera Community

Re: NiFi message when emptying queues: "Waiting fo...

Re: How to disable the Interpreter tab and for al...

Re: HDF be used to feed Logstash?

Re: Hive Update - how to update a txt file in HDFS...

Re: How to query/perform OLAP operations on cube c...

Are there any OS level changes/settings that diffe...

YARN Preemption with Spark using a Fair Policy

Re: How to disable the Interpreter tab and for al...

Re: What is the least privilege access model for A...

What is the least privilege access model for AWS I...

Re: HDF be used to feed Logstash?

Syslog Forwarding to NiFi on your Mac

Re: Hive Update - how to update a txt file in HDFS...

Re: How to query/perform OLAP operations on cube c...

Change default user and password Cloudbreak from a...