About dkumar1

dkumar1 · ‎12-14-2015

I recommend launching the HDP 2.3 Sandbox directly on Azure as mentioned in the blog. You'll get a Centos VM with HDP services running on it. It is well tested and supported.

dkumar1 · ‎12-14-2015

@Raghavendran Chellappa Tableau or any other BI tool for that matter can't connect directly to Spark Streaming. Spark Streaming only processes the data -- you still need to persist it in HDFS or somewhere else before Tableau or anything else can connect to it. In case you need to do interactive analysis with a very short SLA, you need a system which can index the data. Pure row scans won't cut it. One example would be to connect Spark Streaming to Solr. Solr will index the data as it is inserted. You can then build a read-only dashboard using Banana, or build a custom app which queries Solr for user-defined queries. So the flow is: Streaming Data -> Spark Streaming -> Solr -> Banana Dashboard (or a custom app if interactivity is desired) Look here for an example of streaming Tweets from Spark into Solr: https://doc.lucidworks.com/lucidworks-hdpsearch/2....

dkumar1 · ‎12-13-2015

Spark is meant for application development. Tez is a library which is used by tools such as Hive to speed things up. Tez isn't suitable for end-user programming.

dkumar1 · ‎12-13-2015

@Cary Walker HDP repo is located on Github. For 2.3.0 dependencies, see here: https://github.com/hortonworks/hadoop-release/blob... You can find the RPM in our public maven repo. Search for "hadoop" here: http://repo.hortonworks.com/index.html

dkumar1 · ‎12-12-2015

Apache Phoenix is currently the only way to query HBase using SQL.

dkumar1 · ‎12-12-2015

In addition to Vectors, you need to import the Spark Vector class explicitly since Scala imports its in-built Vector type by default. Try this: import org.apache.spark.mllib.linalg.{Vector, Vectors}

dkumar1 · ‎12-11-2015

Which version of Spark and HDP are you using?

dkumar1 · ‎12-04-2015

@bsaini Iterative computations are best in Spark for large data sets, not for CPU bound processes which use a small data set repeatedly.

dkumar1 · ‎12-04-2015

@Peter Coates why do you need Spark if the data is very small and can fit on a single node? There are other excellent Monte Carlo simulation packages which can do this efficiently -- open source or otherwise. Even Excel has an add-in for this. edit: If you need more horsepower for Monte Carlo simulations which one node can't provide, you can look at MPI. Mpich is pretty good: https://www.mpich.org/ There's even a Yarn adapter for Mpich: https://github.com/alibaba/mpich2-yarn

dkumar1 · ‎12-04-2015

As @Ali Bajwa wrote above, use the Zeppelin Service to install Zeppelin on HDP.

Online	Offline
Last Visited	‎03-09-2018 06:11 AM

Member Since	‎09-25-2015 04:31 AM
Last Visited	‎03-09-2018 06:11 AM
Posts	24
Kudos received	6

Cloudera Community

Re: Averaging RandomForest votes in Spark 1.3.1

Re: Specific version of source code (2.7.1.2.3.0.0...

Re: Type Error when attempting Linear Regression

Re: Problem starting VirtualBox Sandbox on Microso...

Re: Use of Spark Streaming for interactive Reporti...

Re: Spark vs Tez?

Re: Specific version of source code (2.7.1.2.3.0.0...

Re: Hbase query best practice

Re: Type Error when attempting Linear Regression

Re: Type Error when attempting Linear Regression

Re: What's the best way to do Monte Carlo simulati...

Re: What's the best way to do Monte Carlo simulati...

Re: No output from Zeppelin on HDP 2.3 using Spark...