Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2908 | 08-29-2016 04:42 PM | |
5771 | 08-09-2016 08:43 PM | |
1775 | 07-19-2016 04:08 PM | |
2515 | 07-07-2016 04:05 PM | |
7508 | 06-29-2016 08:25 PM |
05-23-2016
06:13 PM
1 Kudo
Hello Alex: You can access Hive tables via Zeppelin in two ways: 1) Use Zeppelin's native Hive interpreter directly by starting a code block with '%sql' interpreter command and issuing commands like 'show tables' or 'select * from table' 2) Via Spark by creating HiveContext and then loading hive table into DataFrame, like this: %spark
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
... View more
05-19-2016
10:04 PM
Please note, I modified the original comment above since it allocated too much PermGen space. I changed the value from 8192M to this, which would require a total of 3 GB RAM to run spark-shell: "-XX:MaxPermSize=1024M -Xmx2048m"
... View more
05-19-2016
09:56 PM
Several others fixed this problem by setting this parameter: spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.0.0-2557
...but please make sure you use the correct version string. You can retrieve it on HDP node by running this command: hdp-select |grep hadoop-yarn-resourcemanager
... View more
05-19-2016
08:02 PM
1 Kudo
By reviewing the stack traces above, it appears you got an Out-Of-Memory error, based on the kill command dump message: -XX:OnOutOfMemoryError You can increase Java Heap memory available to Spark by using the option spark.driver.extraJavaOptions, which can also be set by command-line parameter --driver-java-options like this: ./bin/spark-shell --verbose --master yarn-client \
--driver-java-options "-XX:MaxPermSize=1024M -Xmx2048m" In addition, you can set corresponding options in the "spark-defaults.conf" config file located under the $SPARK_HOME/conf directory. The corresponding parameter is named "spark.driver.extraJavaOptions". To summarize, properties are applied to a job submission in the following order:
Defaults from spark-defaults.conf Command line arguments. Program embedded overrides via org.apache.spark.SparkConf APIs.
... View more
05-13-2016
08:30 PM
@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".
... View more
05-13-2016
06:34 PM
Hi Puneet: I'm not 100% certain I understand your question, but let me suggest:
If you have a DataFrame or RDD (resilient distributed dataset in memory), and you want to see before/after state for a given Transformation, you could run a relatively low-cost action like take() or sample() to print a few elements from your dataframe. These are relatively low cost operations which only return a few elements to the driver. Full documentation for DataFrame.take() is here:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
Excerpt here:
DataFrame class:
def take(n: Int): Array[Row]
Returns the first n rows in the DataFrame.
Running take requires moving data into the applications driver process, and doing so with a very large 'n' can crash the driver process with OutOfMemoryError.
... View more
05-10-2016
09:25 PM
Hello Jasper:
It looks like Zeppelin lib directory is missing lots of things. For instance here is mine under HDP 2.4 Sandbox:
[root@sandbox lib]# cd /usr/hdp/current/zeppelin-server/lib
[root@sandbox lib]# ls -l
total 1384
drwxr-xr-x 2 zeppelin hadoop 4096 2016-03-14 14:34 bin
lrwxrwxrwx 1 zeppelin hadoop 18 2016-03-14 14:34 conf -> /etc/zeppelin/conf
drwxr-xr-x 17 zeppelin hadoop 4096 2016-03-14 14:35 interpreter
drwxr-xr-x 2 zeppelin hadoop 4096 2016-03-14 14:35 lib
-rw-r--r-- 1 zeppelin hadoop 13540 2016-02-10 10:40 LICENSE
drwxr-xr-x 5 zeppelin hadoop 4096 2016-05-04 23:36 local-repo
drwxr-xr-x 21 zeppelin hadoop 4096 2016-04-19 13:36 notebook
-rw-r--r-- 1 zeppelin hadoop 6675 2016-02-10 10:40 README.md
drwxr-xr-x 3 zeppelin hadoop 4096 2016-04-14 16:30 webapps
-rw-r--r-- 1 zeppelin hadoop 66393 2016-02-10 10:48 zeppelin-server-0.6.0.2.4.0.0-169.jar
-rw-r--r-- 1 zeppelin hadoop 1297455 2016-02-10 10:48 zeppelin-web-0.6.0.2.4.0.0-169.war
Did you already perform the Add Service step: After Ambari restarts and service indicators turn green, add the Zeppelin Service:
At the bottom left of the Ambari dashboard, choose Actions -> Add Service: On the Add Service screen, select the Zeppelin service.
Step through the rest of the installation process, accepting all default values.
On the Review screen, make a note of the node selected to run Zeppelin service; call this ZEPPELIN_HOST. Click Deploy to complete the installation process.
... View more
05-05-2016
03:54 PM
1 Kudo
This is the image showing the phases of the optimizer (from @Rajkumar Singh link above)
... View more
05-04-2016
08:19 PM
2 Kudos
First of all, you can show the EXPLAIN PLAN with this syntax:
spark-sql> EXPLAIN SELECT * FROM mytable WHERE key = 1;
Yes, Spark SQL will always use the Catalyst optimizer. In addition, DataFrames operations will also use it now. This is shown in the diagram where the Sql Query (AST parser output) and DataFrames both feed into the Analysis phase of the optimizer.
Also, be aware that there are 2 types of contexts, SQLContext and HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables.
... View more
05-04-2016
07:12 PM
Hi Henry: Since you are requesting 15G for each executor, you may want to increase the size of Java Heap space for the Spark executors, as allocated using this parameter: spark.executor.extraJavaOptions='-Xmx24g'
... View more