About cstanca

cstanca · ‎09-21-2016

@Sandeep Nemuri Could you answer @Mats Johansson? I am interested in the meaning of your question ... The query seems abandoned and the community needs to understand the question and the answer.

cstanca · ‎09-21-2016

@Kumar Veerappan 1.3.1 is the Spark version supported by HDP 2.3.0. Would it be possible that someone installed a newer version of Spark outside of Ambari then uninstalled and Ambari is caching somehow that version. Did you restart Ambari server and checked again?

cstanca · ‎09-21-2016

@RAMESH K If the response was helpful, please vote and accept it as the best answer.

cstanca · ‎09-21-2016

@henryon wen As you already know, Spark 1.6.1 is part of HDP 2.4.2. While is technically possible to upgrade to 1.6.2, it is not supported by Hortonworks. There may be other implications for Zeppelin and other tools from the ecosystem based on how your applications are built and executed. If you have paid support, make sure that you contact support before proceeding.

cstanca · ‎09-19-2016

@Shiva Nagesh I agree with @hkropp. While you can, it does not mean you should AS-IS. You need to account for shortcomings, architecturally and resource management-wise, to not mention security concerns, bringing more services on the edge nodes than usually manageable. I get it that you have capacity on those edge nodes and would like to use them as a BURST in case of need. You could consider DOCKER containers on your EDGE SERVERS as such you can separate the true edge nodes from workers on demand. Those DOCKER containers would use a WORKER template and will be spinned-up quickly as an additional node, something similar with what you would do in a cloud.

cstanca · ‎09-19-2016

@srinivasa rao I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>" However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism. tez.grouping.max-size and tez.grouping.min-size are split parameters. Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html If any of the responses was helpful, please don't forget to vote/accept the answer.

cstanca · ‎09-17-2016

@Rahul Reddy Kamuru Yes. You can install and integrate. However, you can also have your web and application servers residing on their dedicated infrastructure and accessing various services from the Hadoop ecosystem via JDBC, ODBC or REST API. From the Hadoop lingo point of view, they would reside on the EDGE nodes and act as Hadoop clients. Installing on the Hadoop cluster while is possible, it will make operations on the Hadoop cluster to impact also those web and application servers, making everything more complicated also on upgrades etc. Separation of concerns generates a clean architecture. However, it is not unusual to have Tomcat installed on the cluster to allow development of various services which will either access data stored in HDFS, Hive, HBase etc. or even submit jobs to Spark. Those will be mainly "glue" services that will allow building some data pipelines or enable data services for BI tools residing outside of the cluster, e.g. Tableau, MicroStrategy, ZoomData etc. If this response helped, please vote and accept it as a best answer.

cstanca · ‎09-16-2016

@Jitendra Yadav Good question. It shouldn't. Occasionally, a feature may slip in. As long as it is not a major change, it is probably something tolerated. It could be also a Tech Preview which is probably the case with Grafana and that is probably fine. Did that feature break an existent functionality?

cstanca · ‎09-16-2016

@P D Ambari repo includes only sources right now: http://www.apache.org/dist/ambari/ambari-2.4.1/ As soon as binaries will be posted you can find them at the ambari link or on HDP ambari public repo. It is a matter of days to have them published. Ambari 2.4.1 was just released this week.

cstanca · ‎09-16-2016

@srivatsan chakravarti You can also read all the messages that are in the retention period for your topic. That way you don't have to run your producer while you test your consumer. You can consume as many times you want from what was produced and it is still retained, usually 7 days, by default. You would have to use low level SimpleConsumer API to implement Java code that will emulate what you can do from the CLI with: bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Spark on R vs R on Spark (SparkR) ?

Re: Question on Spark Versioning

Re: Spark Standalone Cluster.

Re: Spark 1.6.2 on HDP 2.4.2

Re: Can i use edge nodes for mapreduce??

Re: Why Map job is launched when I run SELECT * FR...

Re: Can we install or integrate Application server...

Re: Ambari release versioning

Re: Where is the repo for Ambari 2.4?

Re: Kafka Java Consumer Client returning empty rec...