Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1077 | 08-29-2016 04:42 PM | |
1849 | 08-09-2016 08:43 PM | |
520 | 07-19-2016 04:08 PM | |
772 | 07-07-2016 04:05 PM | |
2681 | 06-29-2016 08:25 PM |
11-12-2019
06:16 AM
Hi, Could you please let us know what do you mean by not yet deployed? Does you mean that the jobs that has not been kicked off after you running the spark submit command (or) Could you please explain in detail. Thanks Akr
... View more
09-15-2016
08:35 PM
3 Kudos
Repo Description Here is a new Zeppelin notebook, part of the Hortonworks Gallery on Github, which can be used as a template for analysing web server log files using Spark and Zeppelin. This notebook was ported from an original Jupyter notebook that was part of an EDX online course: "Introduction to Apache Spark", sponsored by Databricks. It is written using "pyspark", the Python interpreter for Spark. You can import this notebook into your own instance of Zeppelin using the "Import Note" button on the home page. Then use the URL below add paste it into the "Add from URL" box. Here is the URL link to the actual Zeppelin notebook (note.json) on hortonworks-gallery: https://github.com/hortonworks-gallery/zeppelin-notebooks/blob/master/2BXSE1MV8/note.json Here is the link to view the notebook on Zeppelin Hub: ZeppelinHub Notebook The source data is an actual HTTP Web Server log taken from the NASA Apollo website. Repo Info Github Repo URL https://github.com/hortonworks-gallery/zeppelin-notebooks Github account name hortonworks-gallery Repo name zeppelin-notebooks
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- Spark
- spark-sql
- zeppelin
- zeppelin-notebook
Labels:
09-13-2016
08:25 PM
1 Kudo
@Kirk Haslbeck Michael is correct you will get 5 total executors
... View more
01-31-2018
09:20 AM
how to set JAVA_HOME path ?
... View more
08-16-2016
11:00 AM
3 Kudos
Apache Zeppelin (version 0.6.0) includes the ability to securely authenticate users and require logins. It uses the Apache Shiro security framework to accomplish this objective. Note: prior versions of Zeppelin did not force users to login. After launching the HDP 2.5 Tech Preview Sandbox on a virtual machine, make sure the Zeppelin service is up and running via Ambari. Next, open the Zeppelin UI either by clicking on: Services (tab) -> Zeppelin notebook (left-hand panel) -> Quick Links (tab) -> "Zeppelin UI" (button) or just by opening a browser at: http://sandbox.hortonworks.com:9995/ (or http://127.0.0.1:9995/) The Zeppelin welcome page should show in the browser, and you should notice a "Login" button in the upper right-hand corner. This will bring up a pop-up window with text entries for username and password. Enter one of the username/password pairs below (these are the defaults listed in the "shiro.ini" file located in the "conf" sub-directory of zeppelin): Username/Password pairs:
admin/password1
user1/password2
user2/password3
user3/password4
If you want to change these passwords or add more users, you can use the "Credentials" tab of the Zeppelin notebook to create additional usernames. After entering the credentials, you will be logged in and the existing notebooks will display on the left-hand side of the Zeppelin screen. If you enter the wrong username or password, you will be directed back to the Welcome page. FYI: For more information about Zeppelin security, see this link: https://github.com/apache/zeppelin/blob/master/SECURITY-README.md FYI: For more detailed information about Apache Shiro configuration options, see this link: http://shiro.apache.org/configuration.html#Configuration-INISections
... View more
- Find more articles tagged with:
- authentication
- Data Science & Advanced Analytics
- How-ToTutorial
- Security
- Spark
- zeppelin
Labels:
08-09-2016
09:22 PM
4 Kudos
Just a few months ago, Apache Storm announced release 1.0 for the distribution. The bullet points below summarize the new features available. For more detailed descriptions, you can go to this link to read the full release notes: http://storm.apache.org/2016/04/12/storm100-released.html Apache Storm 1.0
Release Apache Storm 1.0 is *up to 16 times faster than
previous versions, with latency reduced up to 60%.” Pacemaker – Heartbeat
Server Pacemaker is an optional Storm daemon designed
to process heartbeats from workers. (overcomes scaling problems of
Zookeeper) Distributed
Cache API Files in the distributed cache can be updated
at any time from the command line, without the need to redeploy a
topology. HA
Nimbus Multiple instances of the Nimbus service run in
a cluster and perform leader election when a Nimbus node fails Native
Streaming Window API Storm has support for sliding and tumbling
windows based on time duration and/or event count. Automatic
Backpressure Storm now has an automatic backpressure
mechanism based on configurable high/low watermarks expressed as a percentage
of a task's buffer size. Resource
Aware Scheduler The new resources aware scheduler (AKA "RAS
Scheduler") allows users to specify the memory and CPU requirements for
individual topology components Storm makes it easier to debug, with… Dynamic Log Levels Tuple Sampling and
Debugging Dynamic Worker
Profiling
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- FAQ
- realtime
- Storm
- stream-processing
- streaming
Labels:
08-09-2016
08:43 PM
First, you should try to take advantage if your data is stored in splittable formats (snappy, LZO, bzip2, etc). If so, then instruct Spark to split the data into multiple partitions upon read. In Scala, you can do this: file = sc.textFile(Path, numPartitions) You will also need to tune your YARN container sizes to work with your executor allocation. Make sure your Max Yarn Mem Alloc ('yarn.scheduler.maximum-allocation-mb') is bigger than what you are asking for per executor (this will include the default overhead of 384 MB). The following parameters are used to allocate Spark executors and driver memory: spark.executor.instances -- number of spark executors
spark.executor.memory -- memory per spark executors (plus 384 MB overhead)
spark.driver.memory -- memory per spark driver 6MB file is pretty small, much smaller than HDFS block size, so
you are probably getting a single partition until you do something to
repartition it. You can also set numPartitions parameter like this: I would probably call one of these repartition methods on
your DataFrame: def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
Returns a new DataFrame partitioned by the given partitioning expressions into numPartitions. The resulting DataFrame is hash partitioned.
OR this: def repartition(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
... View more
07-28-2017
08:19 AM
We were into the same scenario where Zeppelin was always launching the 3 Containers in YARN even after having the Dynamic allocation parameters enabled from Spark but Zeppelin is not able to pick these parameters,
To get the Zeppelin to launch more than 3 containers (the default it is launching) we need to configure in the Zeppelin Spark interpreter spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.initialExecutors=0
spark.dynamicAllocation.minExecutors=2 --> Start this value with the lower number, if not it will launch number of the minimum containers specified and will only use the required containers (memory and VCores) and rest of the memory and VCores will be marked as reserved memory and causes memory issues
spark.dynamicAllocation.maxExecutors=10
And it is always good to start with less executor memory (e.g 10/15g) and more executors (20/30) Our scenario we have observed that giving the executor memory (50/100g) and executors as (5/10) the query took 3min 48secs (228sec) --> which is obvious as the parallelism is very less and reducing the executor memory (10/15g) and increasing the executors (25/30) the same query took on 54secs. Please note the number of executors and executor memory are usecase dependent and we have done few trails before getting the optimal performance for our scenario.
... View more
07-07-2016
03:51 PM
So far this is the best answer I can find but it doesn't fully satisfy my requirement. df.select($"tradeId", $"assetClass", $"transType", $"price", $"strikePrice", $"contractType", $"stockAttributes.*", $"account.*").printSchema The schema is now flat with minimal coding but it did require static typing of the fields. root
|-- tradeId: string (nullable = true)
|-- assetClass: string (nullable = true)
|-- transType: string (nullable = true)
|-- price: string (nullable = true)
|-- strikePrice: string (nullable = true)
|-- contractType: string (nullable = true)
|-- 52weekHi: string (nullable = true)
|-- 5avg: string (nullable = true)
|-- accountType: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: string (nullable = true)
... View more
07-02-2016
08:24 AM
This is going to be one of two issues: Disk I/O or Executors. With count() you will not be doing any swapping. Set the number of partitions executors to (cores * nodes) - 2. Let's assume 8 cores, then that's 30 for you or val rdd = sc.textFile("some file",30) That being said, I don't see how a shuffle is going to help a simple count. That is going to execute on each partition without shuffle and return to the driver. You can perform a test by changing count() to saveAsTextFile(), but I'm thinking you are bound by disk I/O. Are you in a cloud environment? Try to do a reduceByKey() then count(). If your processing time is still about the same, then it further points to disk I/O.
... View more
06-18-2016
05:30 AM
When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable). The actual partition size is defined by the FileInputFormat.computeSplitSize using the below formula: return Math.max(minSize, Math.min(goalSize, blockSize)
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize
blockSize is the value of the dfs.block.size in cluster mode and fs.local.block.size in the local mode
goalSize=totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path.
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions)
... View more
05-19-2016
08:05 AM
Very big thnak you
... View more
05-06-2016
07:06 PM
Thanks for your help. And do you know if the diagram of the jobs executed after we execute a query, the DAG visualization is about what? That visualization shows the physical or logical plan?
... View more
04-18-2016
05:33 PM
Thanks @Bernhard Walter - yes, that's exactly what I did.
... View more
03-31-2016
08:08 AM
i'm running HDP 2.4 , i restarted Spark and it appear in the list of my notebook's interpreter but the problem still exists !
... View more
03-24-2016
08:26 PM
From the error message in the stack trace, it looks like you may have mistyped the spark-submit command line. The main class definition is provided by the --class <main-class> parameter, as shown in this syntax definition: ./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments] If you have put the string "Dhdp.version=2.3.4.1-10" on the command-line, then it could lead to the error. The other possibility is that you have entered this string into the "spark-env.sh" file within $SPARK_HOME/conf directory. Double check this file and look for any parameter ending in "OPTS", such as "SPARK_DAEMON_JAVA_OPTS". This could be adding something wrong to the spark-submit arg list, and lead to the error.
... View more
03-18-2016
12:36 PM
1 Kudo
Pay attention to format of shell action arguments, it should be like <exec>java</exec>
<argument>-classpath</argument>
<argument>$CLASSPATH</argument>
<argument>Hello</argument> instead of single command. Also be aware that shell command is executed on arbitrary node of the cluster, so all tools you're using have to be preinstalled on all the nodes.. Is not your case for now, cuz you're using single node sandbox, but it might be a problem in production Regards
... View more
11-02-2017
03:15 PM
This problem was noted while running HDP 2.5. Apparently this problem was fixed in version 2.6. However, this required also updating the Oracle VM VirtualBox to version 5.1.30. This problem has not reappeared.
... View more
03-31-2017
12:21 PM
Hi, I am planning to create Ambari Hadoop Storm Cluster and as this is fresh new for me I have some doubts how to setup it on the best way. Here is what I have for resources: - Platform: AWS (8 EC2 instances - 1 master. 4 slaves, 3 workers (zookeepers)) - Tool: As I want to automate setup, I will use Terraform, Ansible and Blueprint to setup all environment - I research a little bit and create some conclusion and I need some advice/opinion is this a good path??? Thanks
MASTER
SLAVE
ZOO
NAMENODE
SECONDARY_NAMENODE
DATANODE
NIMBUS
RESOURCE_MANAGER
NODEMANAGER
DRPC_SERVER
SUPERVISOR
ZOOKEEPER_SERVER
STORM_UI_SERVER
ZOOKEEPER_CLIENT
METRICS_MONITOR
ZOOKEEPER_CLIENT
METRICS_MONITOR
MAPREDUCE2_CLIENT
HDFS_CLIENT
HDFS_CLIENT
HDFS_CLIENT
PIG
PIG
PIG
TEZ_CLIENT
TEZ_CLIENT
TEZ_CLIENT
YARN_CLIENT
YARN_CLIENT
YARN_CLIENT
METRICS_COLLECTOR
HISTORY_SERVER
METRICS_GRAFANA
MAPREDUCE2_CLIENT
APP_TIMELINE_SERVER
HIVE_SERVER
HCAT
HIVE_METASTORE
WEBHCAT_SERVER
MYSQL_SERVER
HIVE_CLIENT
... View more
01-07-2016
01:37 PM
Performance really isn't slow when executing the query. This is interesting. I figured that because the query had utilized the CBO in the tutorial I linked in the original question that it would still work now. I guess my thinking is incorrect?
... View more
10-16-2017
06:43 AM
This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true") Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2 , This can be achieved by setting the following property in the spark. sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true") Note: i tested it in spark 1.6.2 Do not set this using spark context but use sqlContext to for dataframes created out of hive tables.
... View more
06-27-2016
07:37 PM
The referenced JIRA above is now resolved. I have successfully tested the new version of the Hive ODBC Driver on Mac OSX version 10.11 (El Capitan). However, please note that you must install the new Hive ODBC driver version 2.1.2 as shown through the iODBC Administration tool Please also note that the location of the driver file has changed. Here is the new odbcinst.ini file (stored in ~/.odbcinst.ini), showing the old driver location commented out and the new driver location below it: [ODBC Drivers]
Hortonworks Hive ODBC Driver=Installed
[Hortonworks Hive ODBC Driver]
Description=Hortonworks Hive ODBC Driver
; old driver location
; Driver=/usr/lib/hive/lib/native/universal/libhortonworkshiveodbc.dylib
; new driver location below
Driver=/opt/hortonworks/hiveodbc/lib/universal/libhortonworkshiveodbc.dylib
... View more
04-27-2018
06:33 AM
set num.partition=x in the server.properties
... View more
02-15-2017
11:11 AM
Thank you @Ali Bajwa for good tutoral. I am trying this example with a difference, My nifi is local and I try to put tweets in a remote Solr. Solr is in a VM that contains Hortonworks sandbox. Unfortunately I am getting this error on PutSolrContentStream processor: PutSolrContentStream[id=f6327477-fb7d-4af0-ec32-afcdb184e545] Failed to send StandardFlowFileRecord[uuid=9bc39142-c02c-4fa2-a911-9a9572e885d0,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1487148463852-14, container=default, section=14], offset=696096, length=2589],offset=0,name=103056151325300.json,size=2589] to Solr due to org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://172.17.0.2:8983/solr/tweets_shard1_replica1; routing to connection_failure: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://172.17.0.2:8983/solr/tweets_shard1_replica1; Could you help me? thanks, Shanghoosh
... View more