About vvaks

vvaks · ‎03-29-2016

Has anyone tried to use Apache Ignite on Yarn with HDFS? Specifically the HDFS acceleration feature (I am guessing similar to Tachyon).

vvaks · ‎03-29-2016

Azure HDInsight service provides that capability to create a Hadoop cluster that can be torn down and brought back up without losing any data (including meta store). Can this setup be achieved with Open Stack Swift and Cloud Break? If so, what are the steps and considerations to implement this architecture?

vvaks · ‎03-29-2016

For complex tasks like facial or object recognition, what are the best enabling frameworks?

vvaks · ‎03-29-2016

There are some great articles and threads here on HCC about using Spark to query data from other JDBC sources and mash them up with anything else you can get into an RDD. Has anyone seen this pattern (Spark as a Federated DB including JDBC sources) actually used in Production (with JDBC thrift server)? What is the right configuration within a secure, multi-tenant Hadoop cluster?

vvaks · ‎03-29-2016

You asked why Data Science teams claim that they cannot do most of their work on the cluster with R. My point is that it is due to the fact that R is mainly a client side studio, not that different from Eclipse (but with more tools). The article I suggested points out the hoops you have to jump through to run R across a cluster. Spark R does not really address this just yet since Spark R is simply an R interpreter that turns the instructions sets into RDDs and then executes them as Spark on the cluster. Spark R does not actually use any of the R packages to exectue logic. Take a look at the Spark R page (https://spark.apache.org/docs/1.6.0/sparkr.html) it mainly talks about creating data frames using R syntax. The section on machine learning covers Gaussian and Binomial GLM, thats it, that is Spark R at this point. If the requirements of your project can be satisfied with using these techniques then great, you can now do your work on the cluster. If not, you will need to learn Spark and Scala. Until Spark has all of the functions and algorithms that R is capable of, Spark R will not completely solve the problem. That is why data scientist that do not have a strong dev background continue to sample data to make it fit on their workstation, so that they can continue to use all of the packages that R provides.

vvaks · ‎03-29-2016

Not generally. You can use a framework like rmr or foreach to simulate parallelism but it there is a bunch of extra work required. Most data scientists just run R on their workstation and sample data to fit within the resource constraints of their system. Once Spark catches up in terms of available algorithms I think more and more data science will be done on the cluster. Check out this blog about the different modes of R http://blog.revolutionanalytics.com/2015/06/using-hadoop-with-r-it-depends.html

vvaks · ‎03-28-2016

Yes, and that is how I am applying my models to the demos that I have built so far. I was interested in whether it is possible to create a Spark context in a Storm Bolt. Sounds like the answer might be no. Is it?

vvaks · ‎03-28-2016

Isn't the issue that Spark R support simply honors the syntax of R and still relies on Spark MLLib for any distributed processing? I don't believe R libraries were designed to run in distributed mode. I think as Scala gains more prominence with the data science community, Spark MLLib acquires more algorithms, and Zeppelin acquires more useful visualizations, much more data science will be done directly on the cluster. The advantages are self explanatory. I know of at least two large companies that have Data Science outfits that no longer sample data, they just go after what is on HDFS and leverage the compute of their Yarn queue. Of course, they have Scala/Spark expertise in house.

vvaks · ‎03-28-2016

Need to see the query statement and your current Hive/Tez/Yarm settings in order find the issue. It look like one of your containers is running out of memory but 4GB is a decent size. You may need to address your query structure or you may simply need more reducers or to increase the container maximum heap size that Tez can request. You will need to check your Yarn container sizing as well since Tez cannot ask for a larger container than Yarn will allow...

vvaks · ‎03-28-2016

Just put your JDBC driver in the classpath and then write the connection to the DB just liek you would from any Java program. Storm is not dependent on HDFS. In fact, you don't need a Hadoop cluster to run Storm. You can read and write on every event that comes through Storm into the DB2 database....

Online	Offline
Last Visited	‎05-08-2018 09:31 PM

Member Since	‎03-24-2016 01:35 PM
Last Visited	‎05-08-2018 09:31 PM
Posts	184
Kudos received	165

Cloudera Community

Re: Why doesn't Atlas draw lineage?

Re: Unable to add phoenix application using Cloudb...

Re: Running a Spark Job with NiFi using Execute Pr...

Re: CreditCardTransactionMonitor Demo - Transactio...

Re: List hbase tables Spark sql

Has anyone tried to use Apache Ignite on Yarn with...

HDFS over object storage for Hadoop on demand?

What frameworks should be used for complex Image p...

Spark SQL as a Federated DB in Production?

Re: Model training outside of edge node?

Re: Model training outside of edge node?

Re: Local Apache Spark Context from Apache Storm

Re: Model training outside of edge node?

Re: HIVE job failed on TEZ

Re: Is there any way to keep the data in DB2 and u...