Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2090 | 08-29-2016 04:42 PM | |
4276 | 08-09-2016 08:43 PM | |
1105 | 07-19-2016 04:08 PM | |
1581 | 07-07-2016 04:05 PM | |
5523 | 06-29-2016 08:25 PM |
09-15-2016
08:35 PM
3 Kudos
Repo Description Here is a new Zeppelin notebook, part of the Hortonworks Gallery on Github, which can be used as a template for analysing web server log files using Spark and Zeppelin. This notebook was ported from an original Jupyter notebook that was part of an EDX online course: "Introduction to Apache Spark", sponsored by Databricks. It is written using "pyspark", the Python interpreter for Spark. You can import this notebook into your own instance of Zeppelin using the "Import Note" button on the home page. Then use the URL below add paste it into the "Add from URL" box. Here is the URL link to the actual Zeppelin notebook (note.json) on hortonworks-gallery: https://github.com/hortonworks-gallery/zeppelin-notebooks/blob/master/2BXSE1MV8/note.json Here is the link to view the notebook on Zeppelin Hub: ZeppelinHub Notebook The source data is an actual HTTP Web Server log taken from the NASA Apollo website. Repo Info Github Repo URL https://github.com/hortonworks-gallery/zeppelin-notebooks Github account name hortonworks-gallery Repo name zeppelin-notebooks
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- Spark
- spark-sql
- zeppelin
- zeppelin-notebook
Labels:
09-13-2016
03:27 PM
2 Kudos
@Kirk Haslbeck Good question, and thanks for the diagrams. Here are some more details to consider. It is a good point that each JVM-based worker can have multiple "cores" that run tasks in a multi-threaded environment. There are benefits to running multiple executors on a single node (single JVM) to take advantage of the multi-core processing power, and to reduce the total JVM overhead per executor. Obviously, the JVM has to startup and initialize certain data structures before it can begin running tasks. From Spark docs, we configure number of cores using these parameters: spark.driver.cores = Number of cores to use for the driver process spark.executor.cores = The number of cores to use on each executor You also want to watch out for this parameter, which can be used to limit the total cores used by Spark across the cluster (i.e., not each worker): spark.cores.max = the maximum amount of CPU cores to request for the application from across the cluster (not from each machine) Finally, here is a description from Databricks, aligning the terms "cores" and "slots": "Terminology:
We're using the term “slots” here
to indicate threads available to perform parallel work for Spark. Spark documentation
often refers to these threads as “cores”,
which is a confusing term, as the
number of slots available on a particular machine does not necessarily have any
relationship to the number of physical CPU
cores on that machine."
... View more
08-30-2016
03:43 PM
Yep, this worked for me as well. Thanks.
... View more
08-29-2016
04:42 PM
1 Kudo
Hello @Rendiyono Wahyu Saputro Yes, you can import python libraries and use them in Spark, which supports a full Python API via the pyspark shell. For instance, if you wanted to load and use the python scikit-fuzzy library to run fuzzy logic, then you just: 1) Download python library, either using maven update to local repo, or directly via github, and add the library to your Spark classpath 2) Kick off job with pyspark shell (Example: $ pyspark --jars /path/to/scikit-fuzzy.jar ) 3) Import python library in your code (Example: "import skfuzzy as fuzz") 4) Use the library More information about scikit-fuzzy library here: https://pypi.python.org/pypi/scikit-fuzzy Hints about dependencies and install: Scikit-Fuzzy depends on
NumPy >= 1.6 SciPy >= 0.9 NetworkX >= 1.9 and is available on PyPi! The lastest stable release can always be obtained and installed simply by running $ pip install -U scikit-fuzzy
... View more
08-23-2016
07:53 PM
8 Kudos
First, you should go to the Apache Spark downloads web page to download Spark 2.0. Link to Spark downloads page: http://spark.apache.org/downloads.html Set your download options (shown in image below), and click on the link next to "Download Spark" (i.e. "spark-2.0.0-bin-hadoop2.7.tgz"): This will download the gzipped tarball to your computer. Next, startup the HDP 2.5 Sandbox image within your virtual machine (either using VirtualBox or VMFusion). Once the image is booted, startup a Terminal session on your laptop and copy the tarball to the VM. Here is an example using the 'scp' (secure copy) command, although you can use any file copy mechanism. scp -p 2222 spark-2.0.0-bin-hadoop2.7.tgz root@127.0.0.1:~ This will copy the file to the 'root' user's home directory on the VM. Next, login (via ssh) to the VM: ssh -p 2222 root@127.0.0.1 Once logged in, unzip the tarball with this command: tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz You can now navigate to the "seed" directory already created for Spark 2.0, and move the contents from the unzipped tar file into the current directory: cd /usr/hdp/current/spark2-client
mv ~/spark-2.0.0-bin-hadoop2.7/* . Next, change the ownership of the new files to match the local directory: chown -R root:root * Now, setup the SPARK_HOME environment variable for this session (or permanently by adding it to ~/.bash_profile) export SPARK_HOME=/usr/hdp/current/spark2-client Let's create the config files that we can edit them to configure Spark in the "conf" directory. cd conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
Edit the config files with a text editor (like vi or vim), and make sure the following environment variables and/or parameters are set below. Add the following lines to the file 'spark-env.sh' and then save the file: HADOOP_CONF_DIR=/etc/hadoop/conf
SPARK_EXECUTOR_INSTANCES=2
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=512M
SPARK_DRIVER_MEMORY=512M
Now, replace the lines in the "spark-defaults.conf" file to match this content, and then save the file: spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native
spark.driver.extraJavaOptions -Dhdp.version=2.5.0.0-817
spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.0.0-817
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
# Required: setting this parameter to 'false' turns off ATS timeline server for Spark
spark.hadoop.yarn.timeline-service.enabled false
#spark.history.fs.logDirectory hdfs:///spark-history
#spark.history.kerberos.keytab none
#spark.history.kerberos.principal none
#spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
#spark.history.ui.port 18080
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 200
spark.yarn.executor.memoryOverhead 200
#spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.yarn.max.executor.failures 3
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3
spark.ui.port 4041
Now that your config files are setup, change directory back to your $SPARK_HOME: cd /usr/hdp/current/spark2-client Before running a Spark application, you need to change 2 YARN settings to enable Yarn to allocate enough memory to run the jobs on the Sandbox. To change the Yarn settings, login to the Ambari console (http://127.0.0.1:8080/), and click on the "YARN" service along the left-hand side of the screen. Once the YARN Summary page is drawn, find the "Config" tab along top and click on it. Scroll down and you will see the "Settings" (not Advanced). Change the settings described below: Note: Use the Edit/pencil icon to set each parameter to the exact values 1) Memory Node (Memory allocated for all YARN containers on a node) = 7800MB 2) Container (Maximum Container Size (Memory)) = 2500MB Alternately, if you click the "Advanced" tab next to Settings, here are the exact config parameter names you want to edit: yarn.scheduler.maximum-allocation-mb = 2500MB
yarn.nodemanager.resource.memory-mb = 7800MB
After editing these parameters, click on the green "Save" button above the settings in Ambari. You will now need to Restart all affected services (Note: a yellow "Restart" icon should show up once the config settings are saved by Ambari; you can click on that button and select "Restart all affected services"). It may be faster to navigate to the Hosts page via the Tab, click on the single host, and look for the "Restart" button there. Make sure that YARN is restarted successfully. Below is an image showing the new YARN settings: Finally, you are ready to run the packaged SparkPi example using Spark 2.0.
In order to run SparkPi on YARN (yarn-client mode), run the command below, which switches user to "spark" and uses spark-submit to launch the precompiled SparkPi example program: su spark --command "bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-memory 2g --executor-memory 2g --executor-cores 1 examples/jars/spark-examples*.jar 10"
You should see lots of lines of debug/stderr output, followed by a results line similar to this: Pi is roughly 3.144799144799145 Note: To run the SparkPi example in standalone mode, without the use of YARN, you can run this command: ./bin/run-example SparkPi
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- FAQ
- hdp-2.5
- Sandbox
- Spark
Labels:
08-16-2016
11:00 AM
3 Kudos
Apache Zeppelin (version 0.6.0) includes the ability to securely authenticate users and require logins. It uses the Apache Shiro security framework to accomplish this objective. Note: prior versions of Zeppelin did not force users to login. After launching the HDP 2.5 Tech Preview Sandbox on a virtual machine, make sure the Zeppelin service is up and running via Ambari. Next, open the Zeppelin UI either by clicking on: Services (tab) -> Zeppelin notebook (left-hand panel) -> Quick Links (tab) -> "Zeppelin UI" (button) or just by opening a browser at: http://sandbox.hortonworks.com:9995/ (or http://127.0.0.1:9995/) The Zeppelin welcome page should show in the browser, and you should notice a "Login" button in the upper right-hand corner. This will bring up a pop-up window with text entries for username and password. Enter one of the username/password pairs below (these are the defaults listed in the "shiro.ini" file located in the "conf" sub-directory of zeppelin): Username/Password pairs:
admin/password1
user1/password2
user2/password3
user3/password4
If you want to change these passwords or add more users, you can use the "Credentials" tab of the Zeppelin notebook to create additional usernames. After entering the credentials, you will be logged in and the existing notebooks will display on the left-hand side of the Zeppelin screen. If you enter the wrong username or password, you will be directed back to the Welcome page. FYI: For more information about Zeppelin security, see this link: https://github.com/apache/zeppelin/blob/master/SECURITY-README.md FYI: For more detailed information about Apache Shiro configuration options, see this link: http://shiro.apache.org/configuration.html#Configuration-INISections
... View more
- Find more articles tagged with:
- authentication
- Data Science & Advanced Analytics
- How-ToTutorial
- Security
- Spark
- zeppelin
Labels:
08-09-2016
09:22 PM
4 Kudos
Just a few months ago, Apache Storm announced release 1.0 for the distribution. The bullet points below summarize the new features available. For more detailed descriptions, you can go to this link to read the full release notes: http://storm.apache.org/2016/04/12/storm100-released.html Apache Storm 1.0
Release Apache Storm 1.0 is *up to 16 times faster than
previous versions, with latency reduced up to 60%.” Pacemaker – Heartbeat
Server Pacemaker is an optional Storm daemon designed
to process heartbeats from workers. (overcomes scaling problems of
Zookeeper) Distributed
Cache API Files in the distributed cache can be updated
at any time from the command line, without the need to redeploy a
topology. HA
Nimbus Multiple instances of the Nimbus service run in
a cluster and perform leader election when a Nimbus node fails Native
Streaming Window API Storm has support for sliding and tumbling
windows based on time duration and/or event count. Automatic
Backpressure Storm now has an automatic backpressure
mechanism based on configurable high/low watermarks expressed as a percentage
of a task's buffer size. Resource
Aware Scheduler The new resources aware scheduler (AKA "RAS
Scheduler") allows users to specify the memory and CPU requirements for
individual topology components Storm makes it easier to debug, with… Dynamic Log Levels Tuple Sampling and
Debugging Dynamic Worker
Profiling
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- realtime
- Storm
- stream-processing
- streaming
Labels:
08-09-2016
08:43 PM
First, you should try to take advantage if your data is stored in splittable formats (snappy, LZO, bzip2, etc). If so, then instruct Spark to split the data into multiple partitions upon read. In Scala, you can do this: file = sc.textFile(Path, numPartitions) You will also need to tune your YARN container sizes to work with your executor allocation. Make sure your Max Yarn Mem Alloc ('yarn.scheduler.maximum-allocation-mb') is bigger than what you are asking for per executor (this will include the default overhead of 384 MB). The following parameters are used to allocate Spark executors and driver memory: spark.executor.instances -- number of spark executors
spark.executor.memory -- memory per spark executors (plus 384 MB overhead)
spark.driver.memory -- memory per spark driver 6MB file is pretty small, much smaller than HDFS block size, so
you are probably getting a single partition until you do something to
repartition it. You can also set numPartitions parameter like this: I would probably call one of these repartition methods on
your DataFrame: def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.
OR this: def repartition(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
... View more
08-09-2016
08:28 PM
The issue is that the input data files to Spark are very small, about 6 MB (<100000
records). However, the required processing/calculations are heavy, which would benefit from running in multiple executors. Currently, all processing is running on a single executor
even when specifying multiple executors to spark-submit.
... View more
Labels:
- Labels:
-
Apache Spark
07-21-2016
04:55 PM
1 Kudo
@Rene Rene Documentation about Dynamic Execution says the following (bold mine): There are two requirements for using this feature. First, your application must set spark.dynamicAllocation.enabled to true . Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them (more detail described below). The way to set up this service varies across cluster managers: In standalone mode, simply start your workers with spark.shuffle.service.enabled set to true . In Mesos coarse-grained mode, run $SPARK_HOME/sbin/start-mesos-shuffle-service.sh on all slave nodes with spark.shuffle.service.enabled set to true . For instance, you may do so through Marathon. In YARN mode, start the shuffle service on each NodeManager as follows: Build Spark with the YARN profile. Skip this step if you are using a pre-packaged distribution. Locate the spark-<version>-yarn-shuffle.jar . This should be under $SPARK_HOME/network/yarn/target/scala-<version> if you are building Spark yourself, and under lib if you are using a distribution. Add this jar to the classpath of all NodeManager s in your cluster. In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services , then set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService . Restart all NodeManager s in your cluster. All other relevant configurations are optional and under the spark.dynamicAllocation.* and spark.shuffle.service.* namespaces. For more detail, see the configurations page. Reference Link: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
... View more
07-20-2016
10:43 PM
Since you are using Jupyter with Spark, you might consider looking at Livy. Livy is an open source REST server for Spark. When you execute a code cell in a PySpark notebook, it creates a Livy session to execute your code. Livy allows multiple users to share the same Spark server through "impersonation support". This should hopefully allow you to access objects using your logged in username. The link below documents the REST commands you can use (for instance, you can use the %%info magic to display the current Livy session information): https://github.com/cloudera/livy/tree/6fe1e80cfc72327c28107e0de20c818c1f13e027#post-sessions
... View more
07-19-2016
04:08 PM
1 Kudo
Spark has a GraphX component library (soon to be upgraded to GraphFrames) which can be used to model graph type relationships. These relationships are modeled by combining a vertex table (vertices) with an edge table (edges). Read here for more info: http://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph
... View more
07-12-2016
08:36 PM
Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth: pyspark.sql.functions.dayofmonth(col) Extract the day of the month of a given date as integer. Example: >>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a'])
>>> df.select(dayofmonth('a').alias('day')).collect()
[Row(day=8)]
... View more
07-11-2016
06:35 PM
@xrcs blue Looks like you are using Spark python API. The pyspark documentation says: join :
on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. Therefore, do the columns exist on both sides of join tables? Also, wondering if you can encode the "condition" separately, then pass it to the join() method, like this: >>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer')
... View more
07-07-2016
04:05 PM
@R Pul Yes, that is a common problem. The first thing I would try is at the Spark configuration level, enable Dynamic Resource Allocation. Here is a description (from link below):
"Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to your application up and down based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be returned to the cluster’s pool of resources and acquired by other applications. In Spark, dynamic resource allocation is performed on the granularity of the executor and can be enabled through spark.dynamicAllocation.enabled ." And in particular, the Remove Policy: The policy for removing executors is much simpler. A Spark application removes an executor when it has been idle for more than spark.dynamicAllocation.executorIdleTimeout seconds. Web page:
https://spark.apache.org/docs/1.2.0/job-scheduling.html
Also, check out the paragraph entitled "Graceful Decommission of Executors" for more information.
... View more
07-01-2016
02:27 PM
@Zach Kirsch The problem is more likely a lack of correlation between Spark's request for RAM (driver memory + executor memory) and Yarn's container sizing configuration. Yarn settings determine min/max container sizes, and should be based on available physical memory, number of nodes, etc. As a rule of thumb, try making the minimum Yarn container size 1.5 times the size of the requested driver/executor memory (in this case, 1.5 GB).
... View more
06-29-2016
08:25 PM
1 Kudo
The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC: https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col.html To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.
... View more
06-28-2016
09:46 PM
You can designate either way by setting --master and --deploy-mode arguments correctly. By designating --master=yarn, the Spark executors will be run on the cluster; --master=local[*] will place the executors on the local machine. The Spark driver location will then be determined by one of these modes: --deploy-mode=cluster runs driver on cluster, --deploy-mode=client runs driver on client (VM where it is launched). More info here: http://spark.apache.org/docs/latest/submitting-applications.html
... View more
06-28-2016
07:03 PM
You probably need to install the spark-client on your VM, which will include all the proper jar files and binaries to connect to YARN. There is also a chance that the version of Spark used by Titan DB was built specifically without YARN dependencies (to avoid duplicates). You can always rebuild your local Spark installation with YARN dependencies, using the instructions here:
http://spark.apache.org/docs/latest/building-spark.html
For instance, here is a sample build command using maven:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
... View more
06-27-2016
07:37 PM
The referenced JIRA above is now resolved. I have successfully tested the new version of the Hive ODBC Driver on Mac OSX version 10.11 (El Capitan). However, please note that you must install the new Hive ODBC driver version 2.1.2 as shown through the iODBC Administration tool Please also note that the location of the driver file has changed. Here is the new odbcinst.ini file (stored in ~/.odbcinst.ini), showing the old driver location commented out and the new driver location below it: [ODBC Drivers]
Hortonworks Hive ODBC Driver=Installed
[Hortonworks Hive ODBC Driver]
Description=Hortonworks Hive ODBC Driver
; old driver location
; Driver=/usr/lib/hive/lib/native/universal/libhortonworkshiveodbc.dylib
; new driver location below
Driver=/opt/hortonworks/hiveodbc/lib/universal/libhortonworkshiveodbc.dylib
... View more
06-27-2016
05:07 PM
@Sri Bandaru Okay, so now I'm wondering if you should include the Spark assembly jar; that is where the reference class lives. Can you try adding this reference to your command-line (assuming your current directory is the spark-client directory, or $SPARK_HOME for your installation): --jars lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar Note: If running on HDP, you can use the soft-link to this file named "spark-hdp-assembly.jar"
... View more
06-27-2016
04:51 PM
@alain TSAFACK
I think you need the --files option to pass the python script to all executor instances. So for example: ./bin/spark-submit --class my.main.Class \
--master yarn-cluster \
--jars my-other-jar.jar,my-other-other-jar.jar
--files return.py
my-main-jar.jar
app_arg1 app_arg2
... View more
06-24-2016
09:28 PM
I was able to run your example on the Hortonworks 2.4 Sandbox (slightly newer version than your 2.3.2). However, it appears you have drastically increased the memory requirements between your 2 examples. You only allocate 512m to the driver and executor in "yarn-client" mode, but allocate 4g and 2g in second example, plus by requesting 3 executors, you will need over 10 GB RAM. Here is the command I actually ran to replicate the "cluster" deploy mode:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 1 --driver-memory 1024m --executor-memory 1024m --executor-cores 1 lib/spark-examples*.jar 10
... and here is the result in the Yarn application logs:
Log Type: stdout
Log Upload Time: Fri Jun 24 21:19:42 +0000 2016
Log Length: 23
Pi is roughly 3.142752
Therefore, it is possible your job never was submitted to the run queue since it required too many resources. Please make sure it was not stuck in the 'ACCEPTED' state from the ResourceManager UI.
... View more
06-23-2016
06:31 PM
Agreed, you should at least upgrade the lower HDP version (...2.3.0...) to the newer HDP version (2.3.4.0-3485). It is best to get the default Spark version from the HDP install. Please see Table 1.1 at this link which describes the version associations for HDP, Ambari, and Spark: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_spark-guide/content/ch_introduction-spark.html
... View more
06-16-2016
06:48 PM
3 Kudos
Spark includes some Jackson libraries as it's own dependencies, including this one: <fasterxml.jackson.version>2.6.5</fasterxml.jackson.version> Therefore, if your additional third-party library also includes this library with a different version, then the classloader will get errors. You can use the Maven Shade plugin to "relocate" the third-party jar, as described here: https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html Here is an example of relocating the "com.fasterxml.jackson" library: http://stackoverflow.com/questions/34764732/relocating-fastxml-jackson-classes-to-my-package-fastxml-jackson
... View more
06-06-2016
07:15 PM
@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.
... View more
05-27-2016
04:04 PM
@Sean Glover The Apache Spark download will allow you to build spark in multiple ways using various build flags to include/exclude components: http://spark.apache.org/docs/latest/building-spark.html Without Hive, you can still create a SQLContext, but it will be native to Spark and not leverage HiveContext. Without a HiveContext, you cannot reference the Hive Metastore, use Hive UDF's etc. Other tools like the Zeppelin data science notebook also default to creating a HiveContext (configurable) so it will need the Hive dependencies.
... View more
05-25-2016
01:45 PM
1 Kudo
Actually, if you don't specify local mode (--master "local") then you will be running in Standalone mode described here:
Standalone mode: By default, applications submitted to the standalone mode cluster will run in FIFO (first-in-first-out) order, and each application will try to use all available nodes. You can limit the number of nodes an application uses by setting the spark.cores.max configuration property in it, or change the default for applications that don’t set this setting through spark.deploy.defaultCores . Finally, in addition to controlling cores, each application’s spark.executor.memory setting controls its memory use. Also, I think you have the port wrong for the Monitor web interface, try using port 4040 instead of 8080, like this: http://<driver-node>:4040
... View more
05-24-2016
04:43 PM
If you are running with deploy mode = yarn (previously, master set to "yarn-client" or "yarn-cluster"), then you can discover the state of the spark job by bringing up the Yarn ResourceManager UI. In Ambari, select Yarn service from left-hand panel, choose "Quick Links", and click on "ResourceManager UI". It will open web page on port 8088. Here is an example (click on 'Applications' in left panel to see all states):
... View more
05-23-2016
06:27 PM
FYI: Here is the quickest way to discover if you have access to your Hive "default" database tables: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tables = sqlContext.sql("show tables")
tables.show()
tables: org.apache.spark.sql.DataFrame = [tableName: string, isTemporary: boolean]
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|sample_07| false|
|sample_08| false|
+---------+-----------+
... View more