Member since
03-24-2016
184
Posts
239
Kudos Received
39
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2541 | 10-21-2017 08:24 PM | |
1555 | 09-24-2017 04:06 AM | |
5580 | 05-15-2017 08:44 PM | |
1680 | 01-25-2017 09:20 PM | |
5508 | 01-22-2017 11:51 PM |
03-29-2016
02:40 PM
3 Kudos
Has anyone tried to use Apache Ignite on Yarn with HDFS? Specifically the HDFS acceleration feature (I am guessing similar to Tachyon).
... View more
Labels:
- Labels:
-
Apache Hadoop
03-29-2016
02:28 PM
2 Kudos
Azure HDInsight service provides that capability to create a Hadoop cluster that can be torn down and brought back up without losing any data (including meta store). Can this setup be achieved with Open Stack Swift and Cloud Break? If so, what are the steps and considerations to implement this architecture?
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
03-29-2016
01:08 PM
2 Kudos
For complex tasks like facial or object recognition, what are the best enabling frameworks?
... View more
Labels:
03-29-2016
12:44 PM
2 Kudos
There are some great articles and threads here on HCC about using Spark to query data from other JDBC sources and mash them up with anything else you can get into an RDD. Has anyone seen this pattern (Spark as a Federated DB including JDBC sources) actually used in Production (with JDBC thrift server)? What is the right configuration within a secure, multi-tenant Hadoop cluster?
... View more
Labels:
- Labels:
-
Apache Spark
03-29-2016
03:27 AM
You asked why Data Science teams claim that they cannot do most of their work on the cluster with R. My point is that it is due to the fact that R is mainly a client side studio, not that different from Eclipse (but with more tools). The article I suggested points out the hoops you have to jump through to run R across a cluster. Spark R does not really address this just yet since Spark R is simply an R interpreter that turns the instructions sets into RDDs and then executes them as Spark on the cluster. Spark R does not actually use any of the R packages to exectue logic. Take a look at the Spark R page (https://spark.apache.org/docs/1.6.0/sparkr.html) it mainly talks about creating data frames using R syntax. The section on machine learning covers Gaussian and Binomial GLM, thats it, that is Spark R at this point. If the requirements of your project can be satisfied with using these techniques then great, you can now do your work on the cluster. If not, you will need to learn Spark and Scala. Until Spark has all of the functions and algorithms that R is capable of, Spark R will not completely solve the problem. That is why data scientist that do not have a strong dev background continue to sample data to make it fit on their workstation, so that they can continue to use all of the packages that R provides.
... View more
03-29-2016
12:33 AM
Not generally. You can use a framework like rmr or foreach to simulate parallelism but it there is a bunch of extra work required. Most data scientists just run R on their workstation and sample data to fit within the resource constraints of their system. Once Spark catches up in terms of available algorithms I think more and more data science will be done on the cluster. Check out this blog about the different modes of R http://blog.revolutionanalytics.com/2015/06/using-hadoop-with-r-it-depends.html
... View more
03-28-2016
05:10 PM
Yes, and that is how I am applying my models to the demos that I have built so far. I was interested in whether it is possible to create a Spark context in a Storm Bolt. Sounds like the answer might be no. Is it?
... View more
03-28-2016
05:01 PM
1 Kudo
Isn't the issue that Spark R support simply honors the syntax of R and still relies on Spark MLLib for any distributed processing? I don't believe R libraries were designed to run in distributed mode. I think as Scala gains more prominence with the data science community, Spark MLLib acquires more algorithms, and Zeppelin acquires more useful visualizations, much more data science will be done directly on the cluster. The advantages are self explanatory. I know of at least two large companies that have Data Science outfits that no longer sample data, they just go after what is on HDFS and leverage the compute of their Yarn queue. Of course, they have Scala/Spark expertise in house.
... View more
03-28-2016
01:53 PM
1 Kudo
Need to see the query statement and your current Hive/Tez/Yarm settings in order find the issue. It look like one of your containers is running out of memory but 4GB is a decent size. You may need to address your query structure or you may simply need more reducers or to increase the container maximum heap size that Tez can request. You will need to check your Yarn container sizing as well since Tez cannot ask for a larger container than Yarn will allow...
... View more
03-28-2016
06:21 AM
Just put your JDBC driver in the classpath and then write the connection to the DB just liek you would from any Java program. Storm is not dependent on HDFS. In fact, you don't need a Hadoop cluster to run Storm. You can read and write on every event that comes through Storm into the DB2 database....
... View more