About Harsh J

Harsh J · ‎09-21-2015

The use of sudo will not pass forward your local environment. Login and try instead.

Harsh J · ‎09-20-2015

Looks like the jar is also required on the front-end. In addition to the ADD JAR within the prompt, please also launch the CLI instead in this way: ~> export HADOOP_CLASSPATH=$(hbase classpath) ~> hive

Harsh J · ‎09-20-2015

The snapshot read path uses a few more jars than the default table read path code does, and the error suggests that at least one such extra jar is not on the default set of aux-jars pre-added for Hive-HBase integration in CDH. You will need to do an "ADD JAR /opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar;" to get this required class on the Hive CLI classpath.

Harsh J · ‎09-20-2015

> Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed The error above likely suggests your HMS is configured for security (kerberos) but that your login lacks a valid TGT (such as one obtained via kinit). Could you post the output of klist, and confirm if a 'hadoop fs' test already works?

Harsh J · ‎09-20-2015

Should be a simple extension in bash, if that is what you're looking for: ./count_row.sh tables.txt | paste -s -d+ - | bc Ref: http://stackoverflow.com/questions/450799/shell-command-to-sum-integers-one-per-line#comment12469220_451204 P.s. It may be more efficient to generate a list of queries and run it via a single hive command, cause each command runs a whole new JVM.

Harsh J · ‎09-20-2015

Sentry, as a service, has its own service-side config maintained and generated by CM within its special process directory. To view any service-side generated configs, visit its instance page and then the processes tab under it. In your case, CM -> Sentry -> Instances -> Sentry Server -> Processes -> sentry-site.xml. Please also consider reading http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_intro_primer.html to understand your Cloudera Manager architecture better.

Harsh J · ‎09-20-2015

Given you want 'engineering' group members to have access to a role 'developer', your grant should be: GRANT ROLE developer TO GROUP engineering Not, GRANT ROLE developer TO GROUP hadoop -- Or was this already done? The response is unclear about this.

Harsh J · ‎09-20-2015

Have you or your AD admins also attempted to profile what specific AD operation(s) are pouring in? Are they group lookups? Or are they actual authentication requests? The latter would normally be unexpected, given use of tokens won't require re-auth. Group lookups are indeed done for every HDFS operation when permissions are in use. However, the groups are also cached internally by HDFS for 5m by default (configurable), and often also by a NameNode-local NSCD/equivalent service. These things help reduce the backend load, but the need is certainly present and the cache timeouts are finite, so it wouldn't be too odd to see a lot of group related requests get fired to whatever user directory backend is in use. Are you already using NSCD? Perhaps that may help you if you aren't, or you can consider raising HDFS's cache timeout: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/core-default.xml#hadoop.security.groups.cache.secs

Harsh J · ‎09-20-2015

Thank you for expanding on the process - it was unclear from the word "last". What you meant was "largest", given the sorting involved. What you are looking to do is only emit the largest (by value) key out, i.e. a MAX(…) behaviour in SQL for example. This is simple to perform: 1. In the Mapper's setup call, initialise a zero-valued string (lowest ascii value) as the base key, along with a zeroed counter. 2. Across all map(…) calls keep track of if the current probable key is greater than the previous encountered key (beginning with the base key set above). Don't emit anything just yet - just keep reassigning the base key if its greater than the existing one (and reset the counter to 1). If its found equal, increment its counter. 3. In the cleanup(…) method, emit just the base key. 4. Given a MAX-like operation, configure a single reducer, and perform the very same max-tracking/final-emit within the setup(…), reduce(…) and cleanup(…) of the Reducer implementation, but take care to do the count aggregations before the compare, so you get the real count.

Harsh J · ‎09-18-2015

The value on the doc page is picked as about 20% of the RAM for overhead reservation, but you could set it lower. Our past overcommit testing does show that the values can reach close to extra 20% in use for some tested workloads, but that would not be an always-as-such case - and this may have changed overall lately also. We're reworking the docs for these recommendations soon in future, as developments happen. For now, please rely on the XLSX file for a more closer guideline on the recommended calculated values.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Hive HBase snapshot query results in exception

Re: Hive HBase snapshot query results in exception

Re: Hive HBase snapshot query results in exception

Re: Not able to enter Hive Shell

Re: How to get the total number of rows from multi...

Re: Sentry-site.xml not generating by cloudera man...

Re: Granted permissions of tables to user but stil...

Re: Yarn MR overloads Active Directory domain cont...

Re: How to get the count of last key value pair in...

Re: YARN Tuning - How is memory overhead estimate ...