About cstanca

cstanca · ‎10-02-2016

@Luis Valdeavellano Those warnings have nothing to do with your issue, but it is good to fix them anyway. 1. If you don't want your job to use all the resources from the cluster, then define a separate YARN queue for Spark jobs and submit it to that queue. I already assume that you submit Spark job via YARN. Obviously, your job will max out that Spark queue resources, but the resources not assigned to that queue could still be used by others. You still have your problem, but others can still execute jobs. 2. Look at your job and determine why is using so much resources, redesign it, tune it, break it in small pieces, etc. If the job is well tuned then your cluster does not have enough resources. Check resource use during the execution of the job to determine the bottleneck (RAM, CPU, etc).

cstanca · ‎10-02-2016

@Avijeet Dash I have used all recent sandboxes on Windows. I am not sure what is the challenge you encountered. As long your VMWare Player works on Windows and you use the needed VMWare Pl, the sandbox works. You just have to meet resource requirements. To download the sandbox for VMWare, VirtualBox or Docker, go here: http://hortonworks.com/downloads/#sandbox Read this document first: https://hortonworks.com/wp-content/uploads/2013/03/InstallingHortonworksSandboxonWindowsUsingVMwarePlayerv2.pdf Keep in mind to use the latest VMWare Player for best experience. Good luck! +++++++++++++++++++++++++++++++++++++++++++++++++ Don't forget to vote and accept the best answer to your question.

cstanca · ‎10-02-2016

@Eric Periard The one you really need (1.1.0) is here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-examples/1.1.0 Pls don't forget to vote and accept the response that helped.

cstanca · ‎10-02-2016

Hello @Avijeet Dash, As Vipin Rathor, many companies are in production with Kerberos with 2.x versions. Please post your issues as individual questions.

cstanca · ‎09-30-2016

@ARUN As @Enis clarified, there will be no impact on your start/stop. If you have custom scripts then you may want fix those files to show all slaves. I have no explanation yet of why the files in the existent slave nodes did not get updated, but you are safe for start/stop.

cstanca · ‎09-30-2016

@Andi Chirita If you are already experienced with Oozie and your work is in Hadoop then look at Apache Falcon. It is part of Hortonworks Data Platform as well. I really like a tool to do more than just scheduling. I want the tool to execute various tasks or delegate execution not only on time but also on event or when conditions are met. Both, Apache Falcon and Apache NiFi can help with that. They are not specialized schedulers, just more than that. Falcon can satisfy your requirements if you live in Hadoop. If you want to do more than that, then look at Apache NiFi. "Falcon simplifies the development and management of data processing pipelines with a higher layer of abstraction, taking the complex coding out of data processing applications by providing out-of-the-box data management services. This simplifies the configuration and orchestration of data motion, disaster recovery and data retention workflows. The Falcon framework can also leverage other HDP components, such as Pig, HDFS, and Oozie. Falcon enables this simplified management by providing a framework to define, deploy, and manage data pipelines". Check here for what Falcon does, how it works etc: http://hortonworks.com/apache/falcon If you want to know more about this project backed by Hortonworks, go to: https://falcon.apache.org/ Apache NiFi is a tool to build a dataflow pipeline (flow of data from edge devices to the datacenter). NiFi has a lot of inbuilt connectors (known as processors in NiFi world) so it can Get/Put data from/to HDFS, Hive, RDBMS, Kafka etc. out of the box. It also has really cool & user friendly interface which can be used to build the dataflow in minutes by dragging and dropping processors. NiFi is an alternative with more support and customer adoption. It has been used heavily at NSA and it is part of Hortonworks Data Flow. To learn more about Apache NiFi go here: https://nifi.apache.org/ NiFi tutorials here: http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi/ Falcon tutorials here: http://hortonworks.com/apache/falcon/#tutorials They do much more than a scheduler. They help build true pipelines which is the usual use case. I evaluated AirFlow and while is a promising project. It is still an incubator phase, not enterprise ready - still many issues, and it is more like a traditional scheduler. It depends on your use case, but if you want other Falcon or Nifi, you can achieve what scheduler does or more. I just love NiFi because I can use it for Hadoop and non-Hadoop. Let me know if you want to see a demo of NiFi and I can set you up.

cstanca · ‎09-30-2016

@Sami Ahmad The upgrade addressed it, but I guess that we still don't know the cause.

cstanca · ‎09-30-2016

@Gaurav Naik The sandbox is for individual users to evaluate on their desktop or laptop. If you wish to ESXI, you could install HDP 2.5 as a single-node cluster. Look here: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.1.0/bk_ambari-installation/content/index.html Start with Ambari, then single-node cluster. It is pretty straightforward and you can save this image and use it multiple times. There is no plan to support sandbox on actual servers, just for individual users on desktop/laptop. If this clarified/help, please vote/accept as best answer.

cstanca · ‎09-30-2016

@Ramy Mansour It seems that your job does not use any parallelism. Among those suggested in this thread, chunking out the CSV input file in multiple parts could also help. It would do what Phoenix would do, but manually. Number of chunks should be determined based on resources that you want to use but for your cluster resources you could probably split the file in at least 25 parts of 10 GB each.

cstanca · ‎09-30-2016

@hitaay What did you see as "Heavy performance impact"? Could you put some numbers next to it? CPU, RAM, disk, etc? How much logging could possibly be happening in your cluster as such collecting specific logs can impact the cluster? How big is your cluster and how utilized is? I haven't seen one case where SmartSense was the culprit. Please help me to document a first case and get possibly to the engineering.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Spark using all resources on Cluster

Re: Does HDP2.5 sandbox work on vmware player on w...

Re: hadoop-examples-1.1.0-SNAPSHOT.jar - MISSING?

Re: Is Kerberos ready for production?

Re: addition of nodes doesnt update the slave and ...

Re: workflow scheduler for ETL

Re: sbt scala compilation error

Re: How to install Hortonworks Sandbox 2.5 on vmwa...

Re: Hbase data ingestion

Re: Are there any node performance implications of...