About Harsh J

Harsh J · ‎02-27-2016

We do not currently recommend the use of StorageBasedAuthorizationProvider. While Sentry's initial setup (esp. with HDFS ACL sync enabled) may seem a little involved, note that its much simpler than ending up in a longer term situation of managing several HDFS paths and keeping them controlled manually. Currently that fix is not in scope of a backport, since this plugin is not supported for use in a CDH environment, but it may be added in future (such as if/when a rebase occurs).

Harsh J · ‎02-27-2016

Are you having issues creating a table backed as AVRO type? What is your CREATE TABLE statement, and how are you loading the file into the table? For more on creating and using Avro tables in Hive, see http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hive.html

Harsh J · ‎02-27-2016

What is your CREATE TABLE statement? Are you specifying the right field delimiter character (\t if you are using that)? The default delimiter otherwise is ^A, which your data likely does not carry.

Harsh J · ‎02-25-2016

The listener feature is to make Kafka Brokers listen on multiple ports: http://kafka.apache.org/documentation.html#security_configbroker. Specifying hostnames along with that is just a way of specifying an interface (i.e. if not to be wild-carded). Thereby, the validation that each element's port be different if multiple are specified, is the right thing for it to do. If you wanted a single port to be listened to globally, simply use PLAINTEXT://0.0.0.0:9092, instead of a multi-host list.

Harsh J · ‎02-25-2016

Unless "hdfs groups x" returns both "A" and "B" in its results, HDFS/HBase/etc. would not be aware of such a relationship. If you use SSSD+LDAP to resolve groups for your OS, you can request it to query a certain level of nesting. See ldap_group_nesting_level under http://linux.die.net/man/5/sssd-ldap.

Harsh J · ‎02-25-2016

Sure - Spark is a pure YARN app for the most part, with little to none server-side components. As long as you submit your application with the right Spark tarball/binary, the specified Spark version will be in use for running that very application. The use of multiple Spark History Servers, if needed, can also be done in form of separated configs and ports. Note that CDH-wise, we ship only one Spark version, bound to its CDH version by build. Formal support of other varied versions outside of the CDH provided one is not covered (if you have a subscription).

Harsh J · ‎02-22-2016

> - is the average load in byte?/kb? > - the average is done on each day/ week? It is the average of number of regions hosted by each RS. Its computed when you run the status command. > - why is the average different than when I do only status and the same on status 'summary' and status 'replication' ? Am not sure I follow. Could you post the observed difference? > - what is the meaning of the 'aggregate load' indicator? The load is aggregated from the number of requests per second ( requestsPerSecond) > - does the compactionProgressPct correesponds to major_compact ? Yes, and it measures the progress for a specific major compacting region at the point of time of running the status command. > - What is the meaning of: totalCompactingKVs / currentCompactedKVs Compactions cause rewrites of KV pairs inside HFiles, into new (fewer/single) HFiles. These numbers track the total KVs participating in the tracked compaction, and the count of how many that have completed persistence so far.

Harsh J · ‎02-18-2016

You can get your live RegionServer IDs with startcodes included via the HBase Shell command: status 'simple' An output line from this, such as the below: host.cloudera.com:60020 1455726247381 Can then be converted into the right format: host.cloudera.com,22101,1455726247381

Harsh J · ‎02-18-2016

The "admin" user is something usually used within Hue (it could of course be a valid user in your environment, but this is the only assumption I can draw). HS2 cleans the temporary elements if the session holding the query that created it, has terminated. With Hue, especially on versions prior to CDH 5.2.0, you may have a situation where the admin user's sessions have never been closed/terminated, and the HS2 continues to hold references of the queries that user ran in past, whereas the other usernames are likely ending their Hue backed sessions correctly (depends on how they're working over Hue). If you have CDH 5.2.0 or above, consider setting the various idle server-side timeouts under CM -> Hive -> Configuration (search "idle").

Harsh J · ‎02-17-2016

CDH 5.4 had Spark 1.3.0 plus patches, which per the blog post seems like it would not work either (it quotes "strong dependency", which I take means ONLY 1.4.1?). CDH 5.5.x onwards carries Spark 1.5.x with patches. There has been no CDH5 release with Spark 1.4.x in it. You could use a Apache Spark 1.4.1 release from upstream, manually rebuilt against your CDH5 version of Apache Hadoop, and use the tar-ball paths for all Spark operations, and this should work. However, such a Spark deployment would not be officially supported by Cloudera Support (if you have a subscription).

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Backport for HIVE-11901

Re: convert text file into avro file..

Re: csv to hive, data not correctly imported to co...

Re: Kafka.properties override for listeners proper...

Re: Hbase ACLs will apply for the subgroup members

Re: Have various Spark version running on the clus...

Re: HBase status command meaning

Re: HBase - Get region start from one shell comman...

Re: Why does /tmp/hive/admin/ take up so much spac...

Re: How to downgrade Spark