About phargis

phargis · ‎05-23-2016

Hello Alex: You can access Hive tables via Zeppelin in two ways: 1) Use Zeppelin's native Hive interpreter directly by starting a code block with '%sql' interpreter command and issuing commands like 'show tables' or 'select * from table' 2) Via Spark by creating HiveContext and then loading hive table into DataFrame, like this: %spark // sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

phargis · ‎05-19-2016

Please note, I modified the original comment above since it allocated too much PermGen space. I changed the value from 8192M to this, which would require a total of 3 GB RAM to run spark-shell: "-XX:MaxPermSize=1024M -Xmx2048m"

phargis · ‎05-19-2016

Several others fixed this problem by setting this parameter: spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.0.0-2557 ...but please make sure you use the correct version string. You can retrieve it on HDP node by running this command: hdp-select |grep hadoop-yarn-resourcemanager

phargis · ‎05-19-2016

By reviewing the stack traces above, it appears you got an Out-Of-Memory error, based on the kill command dump message: -XX:OnOutOfMemoryError You can increase Java Heap memory available to Spark by using the option spark.driver.extraJavaOptions, which can also be set by command-line parameter --driver-java-options like this: ./bin/spark-shell --verbose --master yarn-client \ --driver-java-options "-XX:MaxPermSize=1024M -Xmx2048m" In addition, you can set corresponding options in the "spark-defaults.conf" config file located under the $SPARK_HOME/conf directory. The corresponding parameter is named "spark.driver.extraJavaOptions". To summarize, properties are applied to a job submission in the following order: Defaults from spark-defaults.conf Command line arguments. Program embedded overrides via org.apache.spark.SparkConf APIs.

phargis · ‎05-13-2016

@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".

phargis · ‎05-13-2016

Hi Puneet: I'm not 100% certain I understand your question, but let me suggest: If you have a DataFrame or RDD (resilient distributed dataset in memory), and you want to see before/after state for a given Transformation, you could run a relatively low-cost action like take() or sample() to print a few elements from your dataframe. These are relatively low cost operations which only return a few elements to the driver. Full documentation for DataFrame.take() is here: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Excerpt here: DataFrame class: def take(n: Int): Array[Row] Returns the first n rows in the DataFrame. Running take requires moving data into the applications driver process, and doing so with a very large 'n' can crash the driver process with OutOfMemoryError.

phargis · ‎05-10-2016

Hello Jasper: It looks like Zeppelin lib directory is missing lots of things. For instance here is mine under HDP 2.4 Sandbox: [root@sandbox lib]# cd /usr/hdp/current/zeppelin-server/lib [root@sandbox lib]# ls -l total 1384 drwxr-xr-x 2 zeppelin hadoop 4096 2016-03-14 14:34 bin lrwxrwxrwx 1 zeppelin hadoop 18 2016-03-14 14:34 conf -> /etc/zeppelin/conf drwxr-xr-x 17 zeppelin hadoop 4096 2016-03-14 14:35 interpreter drwxr-xr-x 2 zeppelin hadoop 4096 2016-03-14 14:35 lib -rw-r--r-- 1 zeppelin hadoop 13540 2016-02-10 10:40 LICENSE drwxr-xr-x 5 zeppelin hadoop 4096 2016-05-04 23:36 local-repo drwxr-xr-x 21 zeppelin hadoop 4096 2016-04-19 13:36 notebook -rw-r--r-- 1 zeppelin hadoop 6675 2016-02-10 10:40 README.md drwxr-xr-x 3 zeppelin hadoop 4096 2016-04-14 16:30 webapps -rw-r--r-- 1 zeppelin hadoop 66393 2016-02-10 10:48 zeppelin-server-0.6.0.2.4.0.0-169.jar -rw-r--r-- 1 zeppelin hadoop 1297455 2016-02-10 10:48 zeppelin-web-0.6.0.2.4.0.0-169.war Did you already perform the Add Service step: After Ambari restarts and service indicators turn green, add the Zeppelin Service: At the bottom left of the Ambari dashboard, choose Actions -> Add Service: On the Add Service screen, select the Zeppelin service. Step through the rest of the installation process, accepting all default values. On the Review screen, make a note of the node selected to run Zeppelin service; call this ZEPPELIN_HOST. Click Deploy to complete the installation process.

phargis · ‎05-05-2016

This is the image showing the phases of the optimizer (from @Rajkumar Singh link above)

phargis · ‎05-04-2016

First of all, you can show the EXPLAIN PLAN with this syntax: spark-sql> EXPLAIN SELECT * FROM mytable WHERE key = 1; Yes, Spark SQL will always use the Catalyst optimizer. In addition, DataFrames operations will also use it now. This is shown in the diagram where the Sql Query (AST parser output) and DataFrames both feed into the Analysis phase of the optimizer. Also, be aware that there are 2 types of contexts, SQLContext and HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.

phargis · ‎05-04-2016

Hi Henry: Since you are requesting 15G for each executor, you may want to increase the size of Java Heap space for the Spark executors, as allocated using this parameter: spark.executor.extraJavaOptions='-Xmx24g'

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: Apache Zeppelin with Hive

Re: Error initializing SparkContext., Containers l...

Re: Error initializing SparkContext., Containers l...

Re: Error initializing SparkContext., Containers l...

Re: distributed processing operation of dataframe ...

Re: How to do logging in Spark Applications withou...

Re: Problem Zeppelin install HDP 2.4

Re: Spark SQL Internally

Re: Spark SQL Internally

Re: spark.yarn.executor.memoryOverhead...