About cstanca

cstanca · ‎09-24-2016

@Vinay Sharma What is your reason to avoid yum? I'd like to know the driver.

cstanca · ‎09-24-2016

@Randy Gelhausen Yes. With HDP 2.5. I missed that memo. Before that was "Installed as Spark Ambari Service, not yet exposed outside of Zeppelin". The following was needed: https://github.com/hortonworks-gallery/ambari-zeppelin-service

cstanca · ‎09-23-2016

@Sunile Manjee No without Livy. Yes with Livy (@vshukla). However, it is exposed only to Zeppelin, for now. Code examples: https://github.com/romainr/hadoop-tutorials-examples/tree/master/notebook/shared_rdd

cstanca · ‎09-23-2016

@Vasilis Vagias I just checked the same on my sandbox and it works fine. Dumb question maybe, are all the needed running, e.g. HiveServer2? Can you beeline and access your database successfully? An Ambari restart can occassionally help.

cstanca · ‎09-23-2016

@Viraj Vekaria Yes. HDFS does the automated sharding and you have no control on it, but rarely one thinks about sharding of a file system like HDFS, but of an actual database like RDBMS, MongoDB or HBase. Semantics, but you asked to use the sharding in HDFS which implies manual control. It is done automatically. At most, what you could do is to change the global replication factor, change the replication factor per file, but you can't do anything about what is replicated where, no data locality control. Since @Justin Watkins mentioned traditional RDBMS and I mentioned also MongoDB and you asked about HDFS, I will summarize the differences in approach to achieve scalability between these three with the added HBase touch. Traditional RDBMS often run into bottlenecks with scalability and data replication when handling large amounts of data/data sets. There are some creative ways to setup master-slave setups to achieve some scalability and performance and all are coming by design, not out of box sharding. MongoDB sharding can be applied to allow distribution across multiple systems for horizontal scalability as needed. Like MongoDB, Hadoop’s HBase database accomplishes horizontal scalability through database sharding. Distribution of data storage is handled by the HDFS, with an optional data structure implemented with HBase, which allocates data into columns (versus the two-dimensional allocation of an RDBMS in columns and rows). Data can then be indexed (through use of software like Solr), queried with Hive, or have various analytics or batch jobs run on it with choices available from the Hadoop ecosystem or your choice of business intelligence platform. If any of the responses is helpful, please don't forget to vote and accept the best answer. Thanks.

cstanca · ‎09-23-2016

G1 vs CMS? Article from @Vedant Jain seems to indicate G1 in Java/JVM section: https://community.hortonworks.com/articles/49789/kafka-best-practices.html (Java/JVM tuning section) LinkedIn Engineering concludes that CMS: https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications Opinions and reasons? What is your field experience?

cstanca · ‎09-21-2016

@Rahul Reddy Kamuru I answered the broad question with a broad answer :). I assume that you have a ton of specific questions. One would ask what are you trying to achieve, but that would reveal the ton of questions. I assume that you already know to install Websphere and Tomcat and that is not in scope for your question. I assume that you already know how to write a web services and that is not in scope of your question. If you wish to setup a JDBC connection to your service of choice in Hadoop ecosystem, check that specific service documentation. It would require a specific JDBC driver and the connection string is pretty much standard with mnor specifics for each service. If you wish to learn more about to write applications that will access various services, aside from reading the developer guides for services in the Hadoop ecosystem, I recommend this link with tutorials: http://hortonworks.com/tutorials/ I'd like to re-iterate that is not a good idea to install those web servers in the hadoop cluster for reasons that I already presented in my original response. You'd rather install them on separate servers which will access remotely services in the hadoop cluster. +++ If you feel that I answered the original question and my response helped, please don't forget to accept it as the best answer. Thanks.

cstanca · ‎09-21-2016

@Justin Watkins I would not associate SHARDING with traditional RDBMS databases, that is mostly an exception, but with NoSQL databases like MongoDB, etc where is mostly the rule. @Viraj Vekaria What are you trying to achieve?

cstanca · ‎09-21-2016

@Raja Sekhar Chintalapati If you want your password to be encrypted you need to have SSL enabled and use -ssl=true flag in the connect string. ++++ If a response was helpful to address your matter, please vote and accept the best answer. If you have a better answer, please add it and a moderator will review it and accept it as it stands correct. Thanks.

cstanca · ‎09-21-2016

@Sandeep Nemuri Without knowing the problem you are trying to resolve, there is no fair comparison between two architectures that are meant to address different functionality. It depends on the type of problems you are trying to resolve. Sorry to add this caveat and not to provide a silver bullet. #1: if you have massive data stored and you need the distributed power of Spark to fetch the data and generate the data frames which are then consumed by the R application, then the following architecture is recommended. R client -> R Server -> SparkR -> Spark -> Data Source (usually Hadoop) #2: if you are building Spark applications that eventually need access to some R functions that are delivered by the R server then your architecture looks more like this: Spark client -> Spark on YARN -> SparkR -> R Server The first case would be what you call R on Spark. The second case would be what you call Spark on R. My observation on multiple customers is that Spark applications use #2 and R applications use #1. A data scientist that uses R as his/her tool and only needs Spark to handle the massive data back and forth, will use #1. A data scientist or an application developer that needs to deliver a Spark application and wants to leverage some existent R functionality will use #2. It is obvious that due to its distributed nature and taking advantage of resources from multiple nodes, an architecture like #1 will benefit an R application that needs Spark's muscle in the cluster, while #2 already has the Spark muscle and brain and will need also some of the R brain. Your R server has lots of brain but not a lot of muscle when is to compare it with Spark on YARN over a large cluster with lots of CPU and RAM. If a response helped to shed some light please don't forget to vote or accept the best answer. If you have a better answer, please add it and a moderator will review it and eventually accept it.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: How to install hdp without yum ?

Re: Is sharing spark RDD or context a supported in...

Re: Is sharing spark RDD or context a supported in...

Re: Why is Hive View not loading

Re: Sharding in HDFS

What is the appropriate GC for Kafka?

Re: Can we install or integrate Application server...

Re: Sharding in HDFS

Re: hide password in beeline

Re: Spark on R vs R on Spark (SparkR) ?