About cstanca

cstanca · ‎09-26-2016

@Arkaprova Saha It depends on you feel about yourself and your future. If you consider yourself a software engineer that has solid Java background and wants to deliver highly optimized and scalable software products based on Spark then you may want to focus more on Scala. If you are more focused on data wrangling, discovery and analysis, short-term use focused studies, or to resolve business problems as quick as possible then Python is awesome. Python has such a large community and code snippets, applications etc. Don't get me wrong, but Python could also be used to deliver enterprise-level applications, but it is more often to use Java and Scala for highly optimized. Python has some culprits, which we will not debate here. Anyhow, I would say that Python is kind of a MUST HAVE and Scala is NICE TO HAVE. Obviously, this is my 2c and I would be amazed that any of these responses in this thread is the ANSWER.

cstanca · ‎09-26-2016

@Bala Vignesh N V If your table is an actual Hive table (not an external table) and it is ACID-enabled (require ORC file format) and Hive/Tez is enabled globally for parallelism and you write those SQL statements as separate jobs, then YES. The assumption is that you run one of the versions of Hive capable of ACID which most likely you do if you use anything released in the last 1.5-2 years.

cstanca · ‎09-24-2016

@Shankar P Following @cduby you can always create a single-node cluster like a sandbox.

mkumar2 · ‎09-23-2016

LinkedIn article is an old article. Kafka documentaion recommends G1 collector. http://kafka.apache.org/documentation.html#java

jean_jeancarl48 · ‎10-11-2016

Yes, It was me who created the ticket.

cstanca · ‎09-19-2016

@srinivasa rao I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>" However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism. tez.grouping.max-size and tez.grouping.min-size are split parameters. Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html If any of the responses was helpful, please don't forget to vote/accept the answer.

cjxqhhh · ‎01-11-2019

@Constantin Stanca Hi, could you please why there could be a split-brain situation when the number of zookeeper nodes is even? Thanks~

raravena80 · ‎12-20-2017

Restart your RM

njayakumar · ‎09-12-2016

@Tim David - Scala would be ideal for the hadoop developments.

rburagohain · ‎10-12-2016

@Constantin Stanca Hi Constantin, The issue was that a hadoop folder got created previously under /usr/hdp folder since there should be only 2 folders named 2.4.2.0-258 and current under /usr/hdp. There should not be any additional folders apart from two folders. After removing the hadoop folder from /usr/hdp, the issue got resolved. Thanks, Rahul

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Should I learn Scala or Python

Re: Is it possible to load hive table parallely?

Re: Do we have the HDP Sandbox as an Amazon commun...

Re: What is the appropriate GC for Kafka?

Re: select count(*) fails with tez over cassandra

Re: Why Map job is launched when I run SELECT * FR...

Re: Zookeeper on even master nodes

Re: Removing YARN job summary

Re: Devlopment cycle with Hadoop

Re: Issue with HDP clients in datanode