Member since
03-16-2016
707
Posts
1753
Kudos Received
203
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5129 | 09-21-2018 09:54 PM | |
6495 | 03-31-2018 03:59 AM | |
1969 | 03-31-2018 03:55 AM | |
2179 | 03-31-2018 03:31 AM | |
4833 | 03-27-2018 03:46 PM |
09-26-2016
06:05 PM
6 Kudos
@Arkaprova Saha It depends on you feel about yourself and your future. If you consider yourself a software engineer that has solid Java background and wants to deliver highly optimized and scalable software products based on Spark then you may want to focus more on Scala. If you are more focused on data wrangling, discovery and analysis, short-term use focused studies, or to resolve business problems as quick as possible then Python is awesome. Python has such a large community and code snippets, applications etc. Don't get me wrong, but Python could also be used to deliver enterprise-level applications, but it is more often to use Java and Scala for highly optimized. Python has some culprits, which we will not debate here. Anyhow, I would say that Python is kind of a MUST HAVE and Scala is NICE TO HAVE. Obviously, this is my 2c and I would be amazed that any of these responses in this thread is the ANSWER.
... View more
09-26-2016
05:52 PM
4 Kudos
@Bala Vignesh N V If your table is an actual Hive table (not an external table) and it is ACID-enabled (require ORC file format) and Hive/Tez is enabled globally for parallelism and you write those SQL statements as separate jobs, then YES. The assumption is that you run one of the versions of Hive capable of ACID which most likely you do if you use anything released in the last 1.5-2 years.
... View more
09-24-2016
02:12 AM
@Shankar P Following @cduby you can always create a single-node cluster like a sandbox.
... View more
09-23-2016
06:30 PM
3 Kudos
LinkedIn article is an old article. Kafka documentaion recommends G1 collector. http://kafka.apache.org/documentation.html#java
... View more
10-11-2016
08:10 AM
Yes, It was me who created the ticket.
... View more
09-19-2016
05:20 PM
2 Kudos
@srinivasa rao I guess you read about when you perform a "select * from <tablename>", Hive fetches the whole data from file as a FetchTask rather than a mapreduce task which just dumps the data as it is without doing anything on it, similar to "hadoop dfs -text <filename>" However, the above does not take advantage of the true parallelism. In your case, for 1 GB will not make the difference, but image a 100 TB table and you do use a single threaded task in a cluster with 1000 nodes. FetchTask is not a good use of parallelism. Tez provides some options to split the data set to allow true parallelism. tez.grouping.max-size and tez.grouping.min-size are split parameters. Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html If any of the responses was helpful, please don't forget to vote/accept the answer.
... View more
01-11-2019
02:36 PM
@Constantin Stanca Hi, could you please why there could be a split-brain situation when the number of zookeeper nodes is even? Thanks~
... View more
10-12-2016
05:36 AM
1 Kudo
@Constantin Stanca Hi Constantin, The issue was that a hadoop folder got created previously under /usr/hdp folder since there should be only 2 folders named 2.4.2.0-258 and current under /usr/hdp. There should not be any additional folders apart from two folders. After removing the hadoop folder from /usr/hdp, the issue got resolved. Thanks, Rahul
... View more