About phargis

AKR · ‎11-12-2019

Hi, Could you please let us know what do you mean by not yet deployed? Does you mean that the jobs that has not been kicked off after you running the spark submit command (or) Could you please explain in detail. Thanks Akr

azeltov · ‎09-13-2016

@Kirk Haslbeck Michael is correct you will get 5 total executors

san_engr · ‎01-31-2018

how to set JAVA_HOME path ?

phargis · ‎08-16-2016

Apache Zeppelin (version 0.6.0) includes the ability to securely authenticate users and require logins. It uses the Apache Shiro security framework to accomplish this objective. Note: prior versions of Zeppelin did not force users to login. After launching the HDP 2.5 Tech Preview Sandbox on a virtual machine, make sure the Zeppelin service is up and running via Ambari. Next, open the Zeppelin UI either by clicking on: Services (tab) -> Zeppelin notebook (left-hand panel) -> Quick Links (tab) -> "Zeppelin UI" (button) or just by opening a browser at: http://sandbox.hortonworks.com:9995/ (or http://127.0.0.1:9995/) The Zeppelin welcome page should show in the browser, and you should notice a "Login" button in the upper right-hand corner. This will bring up a pop-up window with text entries for username and password. Enter one of the username/password pairs below (these are the defaults listed in the "shiro.ini" file located in the "conf" sub-directory of zeppelin): Username/Password pairs: admin/password1 user1/password2 user2/password3 user3/password4 If you want to change these passwords or add more users, you can use the "Credentials" tab of the Zeppelin notebook to create additional usernames. After entering the credentials, you will be logged in and the existing notebooks will display on the left-hand side of the Zeppelin screen. If you enter the wrong username or password, you will be directed back to the Welcome page. FYI: For more information about Zeppelin security, see this link: https://github.com/apache/zeppelin/blob/master/SECURITY-README.md FYI: For more detailed information about Apache Shiro configuration options, see this link: http://shiro.apache.org/configuration.html#Configuration-INISections

phargis · ‎08-09-2016

Just a few months ago, Apache Storm announced release 1.0 for the distribution. The bullet points below summarize the new features available. For more detailed descriptions, you can go to this link to read the full release notes: http://storm.apache.org/2016/04/12/storm100-released.html Apache Storm 1.0 Release Apache Storm 1.0 is *up to 16 times faster than previous versions, with latency reduced up to 60%.” Pacemaker – Heartbeat Server Pacemaker is an optional Storm daemon designed to process heartbeats from workers. (overcomes scaling problems of Zookeeper) Distributed Cache API Files in the distributed cache can be updated at any time from the command line, without the need to redeploy a topology. HA Nimbus Multiple instances of the Nimbus service run in a cluster and perform leader election when a Nimbus node fails Native Streaming Window API Storm has support for sliding and tumbling windows based on time duration and/or event count. Automatic Backpressure Storm now has an automatic backpressure mechanism based on configurable high/low watermarks expressed as a percentage of a task's buffer size. Resource Aware Scheduler The new resources aware scheduler (AKA "RAS Scheduler") allows users to specify the memory and CPU requirements for individual topology components Storm makes it easier to debug, with… Dynamic Log Levels Tuple Sampling and Debugging Dynamic Worker Profiling

phargis · ‎08-09-2016

First, you should try to take advantage if your data is stored in splittable formats (snappy, LZO, bzip2, etc). If so, then instruct Spark to split the data into multiple partitions upon read. In Scala, you can do this: file = sc.textFile(Path, numPartitions) You will also need to tune your YARN container sizes to work with your executor allocation. Make sure your Max Yarn Mem Alloc ('yarn.scheduler.maximum-allocation-mb') is bigger than what you are asking for per executor (this will include the default overhead of 384 MB). The following parameters are used to allocate Spark executors and driver memory: spark.executor.instances -- number of spark executors spark.executor.memory -- memory per spark executors (plus 384 MB overhead) spark.driver.memory -- memory per spark driver 6MB file is pretty small, much smaller than HDFS block size, so you are probably getting a single partition until you do something to repartition it. You can also set numPartitions parameter like this: I would probably call one of these repartition methods on your DataFrame: def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned. OR this: def repartition(numPartitions: Int): DataFrame Returns a new DataFrame that has exactly numPartitions partitions.

dheer_vijji_rag · ‎07-28-2017

We were into the same scenario where Zeppelin was always launching the 3 Containers in YARN even after having the Dynamic allocation parameters enabled from Spark but Zeppelin is not able to pick these parameters, To get the Zeppelin to launch more than 3 containers (the default it is launching) we need to configure in the Zeppelin Spark interpreter spark.dynamicAllocation.enabled=true spark.shuffle.service.enabled=true spark.dynamicAllocation.initialExecutors=0 spark.dynamicAllocation.minExecutors=2 --> Start this value with the lower number, if not it will launch number of the minimum containers specified and will only use the required containers (memory and VCores) and rest of the memory and VCores will be marked as reserved memory and causes memory issues spark.dynamicAllocation.maxExecutors=10 And it is always good to start with less executor memory (e.g 10/15g) and more executors (20/30) Our scenario we have observed that giving the executor memory (50/100g) and executors as (5/10) the query took 3min 48secs (228sec) --> which is obvious as the parallelism is very less and reducing the executor memory (10/15g) and increasing the executors (25/30) the same query took on 54secs. Please note the number of executors and executor memory are usecase dependent and we have done few trails before getting the optimal performance for our scenario.

nanyim_alain · ‎05-19-2016

Very big thnak you

joncodin · ‎05-06-2016

Thanks for your help. And do you know if the diagram of the jobs executed after we execute a query, the DAG visualization is about what? That visualization shows the physical or logical plan?

terry_mitchell · ‎11-02-2017

This problem was noted while running HDP 2.5. Apparently this problem was fixed in version 2.6. However, this required also updating the Oracle VM VirtualBox to version 5.1.30. This problem has not reappeared.

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: List all created spark jobs

Re: Spark num-executors setting

Re: How to install and run Spark 2.0 on HDP 2.5 Sa...

How do I login to Zeppelin when Security is enable...

What enhancements does the Apache Storm Release 1....

Re: How to tune Spark for parallel processing when...

Re: Spark dynamic-allocation dont work

Re: distributed processing operation of dataframe ...

Re: Spark SQL Internally

Re: What is the workaround when getting Hive OutOf...