About Harsh J

seafrog · ‎08-21-2014

Yes, the user group of impala node and hive nodes are the same. I finally got the answer of my question. If I set "hive.sentry.restrict.defaultDB" to true in sentry-site.xml, the behavior of impala and hive will be the same. Because the default value of "hive.sentry.restrict.defaultDB" is false by default. Refer line 48 of HiveAuthzConf.java of sentry source code.

Urantian · ‎08-18-2014

We are still experiencing periodic problems with applications hanging when a number of jobs are submitted in parallel. We have reduced 'maxRunningApps', increased the virtual core count, and also increased 'oozie.service.callablequeueservice.threads' to 40. In many cases, the applications do not hang, however this is not consistent. Regarding YARN issue number 1913 (https://issues.apache.org/jira/browse/YARN-1913), is this patch incorporated in CDH 5.1.0, the version we are using? YARN-1913 indicates the affected version is 2.3.0, and is fixed in 2.5.0. Our Hadoop version in 5.1.0 is 2.3.0. Thank you, Michael Reynolds

tnarayanarao2277 · ‎08-12-2014

I fix the problem. I come to know that it is not a flume issue it is purely HDFS issue, then i done the below steps step1: stop all the services Step2: started name node then when am trying to start the data nodes on the 3 servers,one of the server throwig the error message /var/log/ ----No such file/directory /var/run --No such file/directory But these files are existing so i check the permissions on those two differ from second server to third server So given the permission to those directories to be in sink and then started all the services then flume working fine, thats it. -Thankyou

webtransactor · ‎08-06-2014

We upgraded ou cluster to CDH 5.1.1 and the problem disappeared.So currently I cannot reproduce the problem. Thanks for the tip with webhdfs.

Harsh J · ‎08-05-2014

You are correct though that this does not exist as a current feature. Please consider filing a HBASE project JIRA upstream requesting (implementation patches welcome too!) this at https://issues.apache.org/jira/browse/HBASE.

Harsh J · ‎07-27-2014

(1) The "driver" part of run/main code that sets up and submits a job executes where you invoke it. It does not execute remotely. (2) See (1), cause it invalidates the supposition. But for the actual Map and Reduce code execution instead, the point is true. (3) This is true as well. (4) This is incorrect. All output "collector" received data is stored to disk (in an MR-provided storage termed 'intermediate storage') after it runs through the partitioner (which divides them into individual local files pertaining to each target reducer), and the sorter (which runs quick sorts on the whole individual partition segments). (5) Functionally true, but it is actually the Reduce that "pulls" the map outputs stored across the cluster, instead of something sending reducers the data (i.e. push). The reducer fetches its specific partition file from all executed maps that produced one such file, and merge sorts all these segments before invoking the user API of reduce(…) function. The merge sorter does not require that the entire set of segments fit into memory at once - it does the work in phases if it does not have adequate memory. However, if the entire fetched output does not fit into the alloted disk of the reduce task host, the reduce task will fail. We try a bit to approximate and not schedule reduces on such a host, but if no host can fit the aggregate data, then you likely will want to increase the number of reducers (partitions) to divide up the amount of data received per reduce task as a natural solution.

doubleocherry · ‎07-21-2014

I initially found this confusing, because the Python library for the Cloudera Manager API lacks helper functions for this API endpoint. Nonetheless, it is easy to implement the API call in Python. I will look into adding a helper class to the open-source Python library. HOST = 'myhost' CLUSTER_NAME = 'mycluster' SERVICE = 'mapreduce1' ACTIVITY_ID = 'your_activity_job_id' parameters = 'clusters/%s/services/%s/activities/%s/metrics' % ( CLUSTER_NAME, SERVICE, ACTIVITY_ID) url = '%s:7180/api/v1/%s' % (HOST, urllib.quote(parameters)) r = requests.get(url,auth=(USERNAME, PASSWORD)) print r.json()

Bommuraj Paramaraj · ‎07-21-2014

Thank you Harsh for your email !!! i was hitting below issue, I increased this "dfs.image.transfer.timeout" and it fixed the issue. https://issues.apache.org/jira/browse/HDFS-4301 Checkpoint was working fine but the issue started when my fsimage size reached 2.1GB. Best Regards, Bommuraj

Bommuraj Paramaraj · ‎07-21-2014

Thank you Harsh. its working !!!

Harsh J · ‎07-20-2014

With CDH4 and CDH5 there's no longer a 'HADOOP_HOME' env-var. It has been instead renamed to 'HADOOP_PREFIX', which for a default parcel environment can be set to /opt/cloudera/parcels/CDH/lib/hadoop.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Hive cannot hide default database with Sentry

Re: Yarn applications hang foreever if run in para...

Re: hdfs block size reducing

Re: On CDH5 I cannot acces files with hftp

Re: Is there a graceful way to failover hbase mast...

Re: NewBee Question on Map reduce

Re: Programmatically tracking MR Job status using ...

Re: checkpoint is not occuring

Re: /getimage: java.io.IOException: GetImage faile...

Re: Where is $HADOOP_HOME/lib on CDH 5.0.1, Parcel...