Member since
09-25-2015
24
Posts
8
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
268 | 12-14-2015 05:59 PM | |
327 | 12-13-2015 05:28 PM | |
853 | 12-12-2015 10:04 PM |
01-19-2016
01:32 AM
3 Kudos
Repo DescriptionRepo Info Github Repo URL https://github.com/DhruvKumar/stocks-dashboard-lab Github account name DhruvKumar Repo name stocks-dashboard-lab
... View more
- Find more articles tagged with:
- dashboard
- Data Ingestion & Streaming
- NiFi
- sample-apps
- solr
- stock data
Labels:
01-07-2016
05:34 PM
@jspeidel Thanks John, your solution worked. I was indeed missing the Oozie HA property.
... View more
01-07-2016
05:33 PM
@rnettleton Yes, John's solution worked. Thanks a lot for your help!
... View more
01-05-2016
08:15 PM
Ah, thanks. Let me try it and see if it works.
... View more
01-05-2016
07:41 PM
Hi Ancil - I'm not using an external db for Ambari. I setup Ambari using "ambari-server setup -s -j /path/to/jdk" which accepts all defaults and only uses my custom JDK path. The default Ambari server db is embedded Postgres. The Blueprint Processor class is responsible for substituting hostnames and it should be just a string replace after the topology has been correctly resolved. So, not sure if choosing an external DB choice will affect it.
... View more
01-05-2016
07:27 PM
Ambari Server log Gist: https://gist.github.com/DhruvKumar/e2c06a94388c51e...
... View more
01-05-2016
07:27 PM
Hi John, blueprint and cluster creation template are linked from the question's description. Please see the links at the end of the description. I've also added the Ambari Server log just now.
... View more
01-05-2016
06:59 PM
Not sure if that matters. To the best of my knowledge, name of the host group can be anything, it is just a String, and the blueprint processor should substitute the correct hosts if they match up with the cluster creation template. See Sean's HA blueprint here which doesn't use the "host_groups" suffix: https://github.com/seanorama/ambari-bootstrap/blob/master/api-examples/blueprints/blueprint-hdfs-ha.json
... View more
01-05-2016
06:46 PM
I'm using Ambari 2.1.2 to install a Highly Available HDP 2.3.4 cluster. The service installation is successful on the nodes, but the services fail to start. Digging in to the logs and the config files, I found that the %HOSTNAME::node:port% strings didn't get replaced with the actual hostnames defined in the cluster configuration template. As a result, the config files contain invalid URIs like these: <property>
<name>dfs.namenode.http-address</name>
<value>%HOSTGROUP::master_2%:50070</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>%HOSTGROUP::master_2%:50070</value>
</property>
Sure enough, the errors while starting the services also pointed to the same reason: [root@worker1 azureuser]# cat /var/log/hadoop/hdfs/hadoop-hdfs-datanode-worker1.log
2016-01-05 02:24:22,601 INFO datanode.DataNode (LogAdapter.java:info(45)) - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = worker1.012g3iyhe01upgbu35npgl5l4a.gx.internal.cloudapp.net/10.0.0.9
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.7.1.2.3.4.0-3485
.
.
.
2016-01-05 02:34:27,068 FATAL datanode.DataNode (DataNode.java:secureMain(2533)) - Exception in secureMain
java.lang.IllegalArgumentException: Does not contain a valid host:port authority: %HOSTGROUP::master_2%:8020
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:198)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:164)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:153)
at org.apache.hadoop.hdfs.DFSUtil.getAddressesForNameserviceId(DFSUtil.java:687)
at org.apache.hadoop.hdfs.DFSUtil.getAddressesForNsIds(DFSUtil.java:655)
at org.apache.hadoop.hdfs.DFSUtil.getNNServiceRpcAddressesForCluster(DFSUtil.java:872)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.refreshNamenodes(BlockPoolManager.java:155)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1152)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:430)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2411)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2345)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2526)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2550)
2016-01-05 02:34:27,072 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-01-05 02:34:27,076 INFO datanode.DataNode (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at worker1.012g3iyhe01upgbu35npgl5l4a.gx.internal.cloudapp.net/10.0.0.9
************************************************************/
Interestingly, the Ambari server log reports that the hostname mapping was successful for master nodes, but I didn't find it for worker nodes. 05 Jan 2016 02:13:40,271 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_5 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
05 Jan 2016 02:13:40,272 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_1 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
05 Jan 2016 02:13:40,273 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_2 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
05 Jan 2016 02:13:40,273 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_3 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
05 Jan 2016 02:13:40,274 INFO [pool-2-thread-1] TopologyManager:598 - TopologyManager.ConfigureClusterTask areHostGroupsResolved: host group name = master_4 has been fully resolved, as all 1 required hosts are mapped to 1 physical hosts.
(but even the master nodes had service startup failure) Here's the config Blueprint Gist: https://gist.github.com/DhruvKumar/355af66897e584b... And here's the cluster creation template: https://gist.github.com/DhruvKumar/9b971be81389317... Here's the result of blueprint exported from Ambari server after installation (using /api/v1/clusters/clusterName?format=blueprint): https://gist.github.com/DhruvKumar/373cd7b05ca818c... Edit: Ambari Server Log: https://gist.github.com/DhruvKumar/e2c06a94388c51e... Note that my non-HA Blueprint which doesn't contain the HOSTNAME syntax works without an issue on the same infrastructure. Can someone please help me debug why the hostnames aren't being mapped correctly? Is it a problem in the HA Blueprint? I have all the logs from the installation and I'll keep the cluster alive for debugging. Thanks.
... View more
Labels:
12-17-2015
05:18 PM
Are you using Virtual Box? This might help: http://www.howtogeek.com/187535/how-to-copy-and-paste-between-a-virtualbox-host-machine-and-a-guest-machine/
... View more
12-14-2015
05:59 PM
To add two RDD values, the general approach is: 0. Convert the RDDs to pair RDD (key-value). You can use zipWithIndex() to do it if your RDD doesn't have implicit keys. 1. Do a union of the two RDDs 2. Do reduceByKey(_+_) on the new RDD Don't use collect, it is slow and you'll be limited by the Driver memory anyway. edit: see here for an example in Scala which you can adapt to Python: http://stackoverflow.com/questions/27395420/concatenating-datasets-of-different-rdds-in-apache-spark-using-scala
... View more
12-14-2015
05:43 PM
I recommend launching the HDP 2.3 Sandbox directly on Azure as mentioned in the blog. You'll get a Centos VM with HDP services running on it. It is well tested and supported.
... View more
12-14-2015
05:31 PM
@Raghavendran Chellappa Tableau or any other BI tool for that matter can't connect directly to Spark Streaming. Spark Streaming only processes the data -- you still need to persist it in HDFS or somewhere else before Tableau or anything else can connect to it. In case you need to do interactive analysis with a very short SLA, you need a system which can index the data. Pure row scans won't cut it. One example would be to connect Spark Streaming to Solr. Solr will index the data as it is inserted. You can then build a read-only dashboard using Banana, or build a custom app which queries Solr for user-defined queries. So the flow is: Streaming Data -> Spark Streaming -> Solr -> Banana Dashboard (or a custom app if interactivity is desired) Look here for an example of streaming Tweets from Spark into Solr: https://doc.lucidworks.com/lucidworks-hdpsearch/2....
... View more
12-13-2015
05:49 PM
Spark is meant for application development. Tez is a library which is used by tools such as Hive to speed things up. Tez isn't suitable for end-user programming.
... View more
12-13-2015
05:45 PM
@Ram Sriharsha @Sushant Bhargav I implemented Baum Welch HMM trainer on Mahout in raw Map Reduce API a few years ago. That's an Expectation Maximization algorithm. It can be adapted to Spark with some work easily. See here: https://issues.apache.org/jira/secure/attachment/1...
... View more
12-13-2015
05:28 PM
2 Kudos
@Cary Walker HDP repo is located on Github. For 2.3.0 dependencies, see here: https://github.com/hortonworks/hadoop-release/blob... You can find the RPM in our public maven repo. Search for "hadoop" here: http://repo.hortonworks.com/index.html
... View more
12-12-2015
10:16 PM
1 Kudo
Apache Phoenix is currently the only way to query HBase using SQL.
... View more
12-12-2015
10:04 PM
In addition to Vectors, you need to import the Spark Vector class explicitly since Scala imports its in-built Vector type by default. Try this: import org.apache.spark.mllib.linalg.{Vector, Vectors}
... View more
12-11-2015
05:18 AM
Which version of Spark and HDP are you using?
... View more
12-04-2015
06:19 PM
@bsaini Iterative computations are best in Spark for large data sets, not for CPU bound processes which use a small data set repeatedly.
... View more
12-04-2015
06:16 PM
1 Kudo
@Peter Coates
why do you need Spark if the data is very small and can fit on a single node? There are other excellent Monte Carlo simulation packages which can do this efficiently -- open source or otherwise. Even Excel has an add-in for this. edit: If you need more horsepower for Monte Carlo simulations which one node can't provide, you can look at MPI. Mpich is pretty good: https://www.mpich.org/ There's even a Yarn adapter for Mpich: https://github.com/alibaba/mpich2-yarn
... View more
12-04-2015
06:06 PM
As @Ali Bajwa wrote above, use the Zeppelin Service to install Zeppelin on HDP.
... View more
11-23-2015
09:40 PM
Sentiment Analysis of Live Twitter Stream Using Apache Spark ============================================================
This application analyzes live tweets and predicts if they are positive, or negative. The application works by connecting
to the Twitter stream, and applying a model built offline using Spark's machine learning library (Mllib) to classify
the tweet's sentiment. Using the instructions on this page, you will be able to build the model on HDP Sandbox and then
apply it to a live twitter stream.
Prerequisites
-------------
* Download the HDP 2.3 Sandbox from [here](http://hortonworks.com/products/hortonworks-sandbox/#install)
* Start the Sandbox, and add its IP address into your local machine's /etc/hosts file:
```bash
$ sudo echo "172.16.139.139 sandbox.hortonworks.com" >> /etc/hosts
```
* Log into the sandbox, and clone this repository:
```bash
$ ssh root@sandbox.hortonworks.com
$ cd
$ git clone https://github.com/DhruvKumar/spark-twitter-sentiment
```
* Download the labeled training data into the sandbox.
```bash
$ wget https://www.dropbox.com/s/1k355mod4p70jiq/dataset.csv?dl=0
```
* Put the tweet data into hdfs, at a location /tmp/tweets
```bash
$ hadoop fs -put tweets /tmp/tweets
```
* Sign up for dev Twitter account and get the OAuth credentials [here for free](https://apps.twitter.com/).
Build and package the code
-----------------------------------------
Compile the code using maven:
```bash
$ cd
$ cd spark-twitter-sentiment
$ mvn clean package
```
This will build and place the uber jar "twittersentiment-0.0.1-jar-with-dependencies.jar" under the target/ directory.
We're now ready to train the model
Train the Model
-----------------------------------------
```bash
$ spark-submit --master yarn-client \
--driver-memory 1g \
--executor-memory 2g \
target/twittersentiment-0.0.1-jar-with-dependencies.jar \
hdfs://tmp/tweets/dataset.csv
trainedModel
```
This will train and test the model, and put it under the trainedModel directory. You should see the results of the
testing, with predicted sentiments like this:
```bash
********* Training **********
Elapsed time: 13868063ms
********* Testing **********
Elapsed time: 16326ms
Training and Testing complete. Accuracy is = 0.6536062932423466
Some Predictions:
---------------------------------------------------------------
Text = i think mi bf is cheating on me!!! T_T
Actual Label = negative
Predicted Label = negative
---------------------------------------------------------------
Text = handed in my uniform today . i miss you already
Actual Label = positive
Predicted Label = negative
---------------------------------------------------------------
Text = I must think about positive..
Actual Label = negative
Predicted Label = negative
---------------------------------------------------------------
Text = thanks to all the haters up in my face all day! 112-102
Actual Label = positive
Predicted Label = positive
---------------------------------------------------------------
Text = <-------- This is the way i feel right now...
Actual Label = negative
Predicted Label = positive
---------------------------------------------------------------
Text = HUGE roll of thunder just now...SO scary!!!!
Actual Label = negative
Predicted Label = positive
---------------------------------------------------------------
Text = You're the only one who can see this cause no one else is following me this is for you because you're pretty awesome
Actual Label = positive
Predicted Label = positive
---------------------------------------------------------------
Text = BoRinG ): whats wrong with him?? Please tell me........ :-/
Actual Label = negative
Predicted Label = negative
---------------------------------------------------------------
Text = I didn't realize it was THAT deep. Geez give a girl a warning atleast!
Actual Label = negative
Predicted Label = negative
---------------------------------------------------------------
Text = i miss you guys too i think i'm wearing skinny jeans a cute sweater and heels not really sure what are you doing today
Actual Label = negative
Predicted Label = negative
********* Stopped Spark Context succesfully, exiting ********
```
Predict sentiment of live tweets using the model
-------------------------------------------------
Now that the model is trained and saved, let's apply it to a live twitter stream and see if we can classify sentiment
accurately. Launch the following command with your twitter dev keys:
```bash
$ cd
$ spark-submit \
--class com.dhruv.Predict \
--master yarn-client \
--num-executors 2 \
--executor-memory 512m \
--executor-cores 2 \
target/twittersentiment-0.0.1-jar-with-dependencies.jar \
trainedModel \
--consumerKey {your Twitter consumer key} \
--consumerSecret {your Twitter consumer secret} \
--accessToken {your Twitter access token} \
--accessTokenSecret {your Twitter access token secret}
```
This command will set up spark streaming, connect to twitter using your dev credentials, and start printing tweets
with predicted sentiment. Label 1.0 is a positive sentiment, and 0.0 is negative. Each tweet and its predicted label
is displayed like this:
```bash
(#Listened Chasing Pavements by Adele on #MixRadio #ListenNow #Pop #NowPlaying #19 http://t.co/qLXGoq8B8u,1.0)
(Work isn't going so bad, but if I did fix it wtf,0.0)
(RT @RandyAlley: Come on let's have a win Rovers! Good luck lads @Shaun_Lunt @TREVORBC83 http://t.co/tsDEZPrJIO,1.0)
(RT @ribbonchariots: dress shirt ftw!!!! & the v gets deeper and deeper....... http://t.co/qAL3zIteVF,1.0)
```
Where to go from here?
-------------------------------------------------
I've used a very simple feature extractor--a bigram model, hashed down to 1000 features. This can be vastly improved.
Experiment with removing stop words (in, the, and, etc.) from the tweets before training as they don't add any info.
Consider lemmafying the tweets, which makes multiple forms of the word appear as one word (train and trains are same).
I've put in the NLP pipeline parsers and lemma-fiers from Stanford NLP library, so you can start from there.
Consider also using tf-idf, and experiment with other classifiers in Spark MLlib such as Random Forest.
... View more
- Find more articles tagged with:
- Data Science & Advanced Analytics
- how-to-tutorial
- IoT
- machine-learning
- Scala
- Spark
Labels:
10-09-2015
07:20 PM
1 Kudo
As a general rule, checking for malformed records and doing something about them should be the first step in any data processing pipeline. Eg: In Cascading, it's a standard practice to plug in the Filter function at the very beginning of the main input pipe. In Spark, we create the input RDD and immediately filter bad records. For Hive CSV case, column inference will happen at read time, so make sure that your ser-de implementation can handle missing columns, or better, remove bad records from the data stored in HDFS before making a Hive table. If you have used Flume for moving CSV into HDFS, then you can add the filtering logic to Interceptors. Otherwise, implement the error handling in your SerDe.
... View more