Member since
05-30-2018
1322
Posts
715
Kudos Received
148
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4021 | 08-20-2018 08:26 PM | |
| 1927 | 08-15-2018 01:59 PM | |
| 2357 | 08-13-2018 02:20 PM | |
| 4069 | 07-23-2018 04:37 PM | |
| 4986 | 07-19-2018 12:52 PM |
03-29-2016
03:27 AM
You asked why Data Science teams claim that they cannot do most of their work on the cluster with R. My point is that it is due to the fact that R is mainly a client side studio, not that different from Eclipse (but with more tools). The article I suggested points out the hoops you have to jump through to run R across a cluster. Spark R does not really address this just yet since Spark R is simply an R interpreter that turns the instructions sets into RDDs and then executes them as Spark on the cluster. Spark R does not actually use any of the R packages to exectue logic. Take a look at the Spark R page (https://spark.apache.org/docs/1.6.0/sparkr.html) it mainly talks about creating data frames using R syntax. The section on machine learning covers Gaussian and Binomial GLM, thats it, that is Spark R at this point. If the requirements of your project can be satisfied with using these techniques then great, you can now do your work on the cluster. If not, you will need to learn Spark and Scala. Until Spark has all of the functions and algorithms that R is capable of, Spark R will not completely solve the problem. That is why data scientist that do not have a strong dev background continue to sample data to make it fit on their workstation, so that they can continue to use all of the packages that R provides.
... View more
03-18-2016
07:41 PM
I gave you a couple of choices in your other thread https://community.hortonworks.com/questions/23666/resynchronize-the-hbase-data-betweentwo-clusters.html
... View more
03-18-2016
08:35 PM
2 Kudos
The SyncTool is not in HDP releases yet, but we are tracking it to bring the tool to released versions.
... View more
03-16-2016
07:06 AM
3 Kudos
In general, Zookeeper doesn't actually required huge drives because it will only store metadata information for many services, I have seen customer using 100G to 250G of partition size for zookeeper data directory and logs which is fine of many cluster deployment. Moreover administrator need to set configuration for automatic purging policy of snapshots and logs directories so that we don't end up by filling all the local storage. Please refer below doc for more info. http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html
... View more
03-15-2016
06:33 PM
1 Kudo
I have installed only zookeeper.
... View more
03-31-2016
09:12 PM
2 Kudos
Ambari doesn't support that yet. We have a Jira for Ambari 3.0.0 https://issues.apache.org/jira/browse/AMBARI-14714 It will allow you to have multiple instances of the same service, and potentially at different stack versions, e.g., Spark 1.6.1, 1.7.0, etc.
... View more
03-14-2016
05:49 PM
@Sunile Manjee To start with https://community.hortonworks.com/questions/2408/ranger-implementation-hive-impersonation-false.html
... View more
03-14-2016
06:51 PM
@Neeraj Sabharwal I am not sure I completely follow. The sql is being run from phoenix command line. Being so isn't the client should it use epoch? If not how to validate?
... View more
05-05-2016
06:06 PM
1 Kudo
We had similar issues with the hive interpreter while trying to run aggregations and gouping by columns: 1. Hive interpreter cannot be declared directly in notebook by using %hive. Interpreter must already be set to hive 2. First line in editor must be blank and Hive QL statements must start on second line, otherwise a NullPointer exception will be thrown after you submit job. This threw us off. Somehow we started the statement on the second line and it executed without errors. Then, when we went back and put the statement on the first line, it failed again. Moved the statement to the second line, with the first line blank, and it executed without any errors. Strange.
... View more
06-09-2016
12:17 PM
4 Kudos
I'm getting same error in HDP 2.4 sandbox, if use %hive on Zeppelin and then aggregate functions are not working... %hive
select count(*) from health_table
java.lang.NullPointerException
at org.apache.zeppelin.hive.HiveInterpreter.getConnection(HiveInterpreter.java:184)
at org.apache.zeppelin.hive.HiveInterpreter.getStatement(HiveInterpreter.java:204)
at org.apache.zeppelin.hive.HiveInterpreter.executeSql(HiveInterpreter.java:233)
at org.apache.zeppelin.hive.HiveInterpreter.interpret(HiveInterpreter.java:328)
at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:295)
at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
at org.apache.zeppelin.scheduler.ParallelScheduler$JobRunner.run(ParallelScheduler.java:162)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745) This issue resolved when i use %sql. I know your issue is not to related HDP sandbox 2.4 but may be this comment help someone using %hive on HDP sandbox 2.4.
... View more