About cstanca

cstanca · ‎09-08-2016

@Rahul Buragohain 1. Are the Ambari agents running? Please check their status. 2. Please check ambari-agent, ambari-metrics-monitor and ambari-server logs for recent ERROR. These logs are usually found under /var/log/ where they run. Each node in the cluster has the agent and the monitor. The ambari-server log is on the ambari server host. Log files rotate. Check the latest without a data stamp. You can try something like cat ambari.log | grep ERROR or tail -1000 ambari.log | grep ERROR Just look at the timestamp to eliminate false positives.

cstanca · ‎09-08-2016

@Tim David I have been in software development for a long period and a huge champion of agile development. The agile development approach works for Hadoop as it does for any other application development. The same best practices, e.g. CI, QA, automation-automation-automation. That part is similar and you can be as creative as you need to delivery faster and better. Regarding tools, once upon a time, MapReduce developers needed a framework to test their MapReduce jobs. MRUnit was considered the framework of choice. However, this is not anymore the choice. There is less and less programmatic MapReduce written manually and more and more generated by different tools in the ecosystem (e.g. Hive, Pig etc) or third-party tools, e.g. Talend Studio. My recommendation is to choose tools for development around the tools from the Hadoop ecosystem you plan to use and their programming languages. For example, if you write Spark with Scala stick with Scala specific tools. If you are a Java shop, just use the tools specific for Java. I know that this is a generic response, but this is the idea. If you have specifics in mind, please submit another question with those specifics and I am sure that the Community, including myself, will be happy to chip in. ********** If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-07-2016

@Bala Vignesh N V They are meant for different use cases and comparing them for performance is a stretch. The exists keyword it's intended as a way to avoid counting of the entire collection or table. Something like: --this statement needs to check the entire table select count(*) from [table] where ... --this statement is true as soon as one match is found exists ( select * from [table] where ... ) The in is best used where you have a static list to pass:select * from [table] where [field] in (1, 2, 3) Assuming that they are used to achieve the exact same functionality, which I find it as a big stretch, EXISTS is much faster than IN . Some may say that IN is faster for smaller data set, but that could be an absolute random result when maybe the list was already cached or the field used was maybe an integer and the table had one record 🙂 Let's understand how they work. EXISTS will return a boolean TRUE as soon as the condition is met and no further table scan is performed. While, IN will still do a full table scan. I hope that clarifies it why in a large table EXISTS will be always faster. If the table has one record 🙂 then the chance is 50-50%. **************** If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-07-2016

@Rajib Mandal I would say, No. Here are the facts as I know them: 1. Sqoop is an application that depends on MapReduce. 2. YARN distributed shell is an example of a non-MapReduce application built on top of YARN. Distributed-Shell is a simple mechanism for running shell commands and scripts in containers on multiple nodes in a Hadoop cluster. There are multiple existing implementations of a distributed shell that administrators typically use to manage a cluster of machines, and this application is a way to demonstrate how such a utility can be implemented on top of YARN. I expect Sqoop to work from command line and I don't expect, by design, to execute from Yarn distributed shell. Sqoop is installed to use YARN by default and it will allocate containers for tasks executed as part of the MapReduce. Distributed shell does not understand MapReduce and can't dictate which container to use to complete a MapReduce job. Could you describe for what you are using the distributed shell until attempting to use it for Sqoop?

cstanca · ‎09-07-2016

@Mark Petronic One option is to monitor the hive log located at /var/log/hive I usually run a continuous tail and grep for keywords specific for SQL (you can build a file and read those keywords from the file and then grep, use a few pipes to achieve this) tail -f hiveserver2.log | grep SELECT FROM or run post-mortem a cat command with grep on the same file knowing the approximate timestamps. One can build an intelligent script to parse the logs if each query executed starts and ends with setting a sort of tag variable. Unfortunately, this is a bit ugly. That's where HiveSQL was before Tez. If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-07-2016

@ScipioTheYounger @mqureshi recommendations are correct. If you have a good monitoring in place and you must have one, 3 zookeepers should be enough. If one fails, you would have a split brain. If you had five and one fails down you still have a split quorum. As you can see, 5 is better than 3 only if 2 fail at the same time which is unlikely. Otherwise, you must have real-time monitoring and recovery. I would add that while you can share zookeepers across multiple services in Data Platform, some organizations prefer to allocate zookeepers specific to their Kafka cluster. In that case you would have 3 zookeepers for Kafka and probably Storm since it is a quite common combo and 3 zookeepers for other services. Anyhow: - monitor the state of your zookeepers - put in place an automated recovery - use 5 zookeepers that makes you more comfortable than 3 If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-06-2016

@Rajib Mandal You MUST run the script in any node where you have sqoop client deployed. Sqoop client is your entry point. It is not a good practice anyway, even if you had sqoop client installed on NameNode to execute it from there. It is also not a good practice to have clients installed on data nodes. You need to dedicate 1-2 machines as EDGE NODES where you install all clients needed and use those to submit jobs. By running on data nodes you impact the resources of the data node and the job client and data nodes processes can impact each other. You need the isolation between data nodes and client nodes. If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-06-2016

@Andrew Watson In kafka properties file you should specify, let say 3 brokers. You find that properties file in the /conf folder. Then you restart kafka service. in HDP 2.5 sandbox, server.properties file can be found at /usr/hdp/current/kafka-broker/conf You have to add a few lines like these and adjust for your logs location preference, or at least create the folders needed. Those ports also need to have forwarding rules in case you want to access anything from outside of your VM, or at least to be available. broker.id=1 port=9092 log.dir=/tmp/kafka-logs-1 broker.id=2 port=9093 log.dir=/tmp/kafka-logs-2 broker.id=3 port=9094 log.dir=/tmp/kafka-logs-3

cstanca · ‎09-06-2016

@rama Yes. That's the idea. One of the files is somehow corrupted.

cstanca · ‎09-06-2016

@bpreachuk Thank you for catching my typo. I wrote the query while on the phone. I voted your addition. Thanks again.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Issue with HDP clients in datanode

Re: Devlopment cycle with Hadoop

Re: Exists or IN which performs better

Re: Yarn Distributed Shell

Re: Obtain the actual query run on Hive when using...

Re: Number of Zookeepers in a 3 rack cluster with ...

Re: Yarn Distributed Shell

Re: Running Multiple Kafka Brokers on one HDP Sand...

Re: urgent need: When try to run simple hive query...

Re: Hive Query not working