Member since
06-23-2014
19
Posts
0
Kudos Received
0
Solutions
06-18-2015
12:17 PM
I have a cluster in with the following parameters: Replication Factor (dfs.replication) is set to "2" Minimal Block Replication (dfs.replication.min, dfs.namenode.replication.min) is set to "2" Maximal Block Replication (dfs.replication.max) is set to "4" I changed the Maximal Block Replication to "3" and restarted HDFS Now MapReduce jobs fails with " Requested replication 4 exceeds maximum 3 at org.apache.hadoop.hdfs.server. blockmanagement.BlockManager. verifyReplication" Why? Shouldn't this new value be used as the max? Why the system attempting to use 4 as the replication number?
... View more
06-11-2015
03:06 PM
I think the best answer to this question is the following, by Allen Wittenauer from LinkedIn: http://qr.ae/7GNMu9 He writes: At LinkedIn (company), I tend to tell users that their ideal reducers should be the optimal value that gets them closest to:
A multiple of the block size
A task time between 5 and 15 minutes
Creates the fewest files possible
... View more
06-11-2015
02:08 PM
I've read conflicting advice for the correct value of "Default Number of Reduce Tasks per Job" (mapreduce.job.reduces) parameter in Yarn? Cloudera Manger's default is listed as "1" - but other documentation claims that this value should be set to "99% of reduce capacity." - which, in the case of a 100 node cluster, might be 99. What is the recommended value for this parameter, on a busy cluster with many jobs running?
... View more
02-17-2015
04:58 PM
So, did you figure this one out?
... View more
02-08-2015
03:39 PM
Unfortunately no this issue has not been resolved. Widening the time window doesn't change anything.
... View more
02-03-2015
09:47 AM
For some reason, Cloudera Job History is no longer being shown in Cloudera (5.2). I cannot see any job history via the Yarn application tab, nor is it available via the Hadoop Yarn History UI. It IS possible for me to see the history of a job via "yarn log --applicationId <app_id>" from the command line. There is no indication about why this, and the History server resource in Cloudera Manager seems to operational. Any idea about what could be the problem?
... View more
12-04-2014
01:20 PM
Thanks for the update dlo. Looks like this might be the case of my making decomission call, then not waiting delete_host/delete_cluster call. I would obviously expect that the API not allow us to remove either hosts or clusters before the node has been completely decomissioned (i.e., I would make sure this is not missing automated test on your end, as it borked a Cloudera Manager instance that was managing a production cluster). The documentation appears to be quite sparse - but in the case of decomissioning some hosts without removing the entire cluster, what is the recommended way to accomplish this? Is it to first make a call to decomission the node(s), then poll the CommissionState of the node(s) before deleting? http://cloudera.github.io/cm_api/apidocs/v8/ns0_apiCommissionState.html Thanks!
... View more
12-04-2014
11:58 AM
Looks like no one else has hit this problem?
... View more
11-26-2014
04:08 PM
Thanks everyone: I posted the bug report in Cloudera Manger forum here: http://community.cloudera.com/t5/Cloudera-Manager-Installation/Cloudera-Express-5-2-0-Decomissioning-Nodes-via-API-causes/m-p/22049#U22049 Essentially, decomissioning hosts using the API causes every host on the cluster to be essentially inoperable...
... View more
11-26-2014
04:07 PM
Using the Cloudera API to decommision nodes on a cluster causes both NullPointer exceptions and cache exceptions. This error causes all clusters being managed to become unusable - it is impossible to update configurations, and all host monitor functions break. It seems like this API call might be causing an error in Cloudera's underlying Postgres database? Cloudera version: Cloudera Express 5.2.0 (#60 built by jenkins on 20141012-2239 git: 179000584849e68f98ad2a7fe710723bd6c29c98) Example code (Python): from cm_api.api_client import ApiResource
from cm_api.endpoints import clusters
from cm_api.endpoints.clusters import create_cluster
from cm_api.endpoints.cms import ClouderaManager
api_resource = ApiResource('cloudera', 7180, username='XXXXXX', password='XXXXXXX', version=7)
cloudera_manager = ClouderaManager(api_resource)
# Get the cluster and its hosts
cluster = clusters.get_cluster(api_resource, 'mycluster')
# Decommission the hosts
cloudera_manager.hosts_decommission(['host_id1', 'host_id2'])
# Decommission the services
cluster.delete_service('yarn-service')
cluster.delete_service('hdfs-service')
# Delete the hosts
for host_id in host_ids:
api_resource.delete_host(host_id)
# Delete the cluster by name
api_resource.delete_cluster('mycluster')
... View more
11-24-2014
06:52 PM
Thanks GautamG! And in community edition - no place to at least post a bug report? - Michael
... View more
11-24-2014
06:39 PM
In using Cloudera Manager 5.X, I've found a potentially serious bug after using the API to delete host nodes. Where should I report this bug?
... View more
10-08-2014
10:50 AM
Hi community: This seems like a stupid question, and maybe I am not seeing an obvious config option - but for some reason, I cannot see any finished jobs of any type in the CM's YARN application view. In the Cloudera Manager -> YARN -> Applications tab, only currently running YARN applications (or those pending or recently killed) are visible. Otherwise the listing is blank. Any ideas? Thanks!
... View more
09-17-2014
02:57 PM
Hi everyone: I am using Cloudera Express 5.1.2, and I would like to be able to set the hostname for the Job and Application history WEB UI for YARN (and other services). Sometimes Cloudera Manager launches windows using the hostname, sometimes the ip address, sometimes localhost - it can be difficult to use without extra configuration.
... View more
07-21-2014
01:38 PM
I initially found this confusing, because the Python library for the Cloudera Manager API lacks helper functions for this API endpoint. Nonetheless, it is easy to implement the API call in Python. I will look into adding a helper class to the open-source Python library. HOST = 'myhost'
CLUSTER_NAME = 'mycluster'
SERVICE = 'mapreduce1'
ACTIVITY_ID = 'your_activity_job_id'
parameters = 'clusters/%s/services/%s/activities/%s/metrics' % (
CLUSTER_NAME, SERVICE, ACTIVITY_ID)
url = '%s:7180/api/v1/%s' % (HOST, urllib.quote(parameters))
r = requests.get(url,auth=(USERNAME, PASSWORD))
print r.json()
... View more
07-16-2014
10:34 AM
I am running some tests using compressed files and small block sizes. I tried to set the dfs.block.size through Cloudera Manager to 8Mb, and received the error: "8388608 less than 16777216." Is 16Mb a hard minimum to the dfs.block.size parameter, or is there potentially another setting that conflicts with dfs.block.size?
... View more
- Tags:
- block size
- HDFS
06-30-2014
03:14 PM
Can you explain the use case? Why use Hadoop/Cloudera if you are simply using a single machine?
... View more
06-30-2014
03:13 PM
Hi Cloudera community: tl;dr: Is there a more straighforward way to query the Cloudera Manager API to get information (mapper/reducer completion, bytes processed, etc) about jobs, perhaps by simply providing a jobId or a jobname? I am using a Python script to check on the status of various MapReduce jobs, using the Cloudera Manager API, roughly something like this: from cm_api.api_client import ApiResource
api = ApiResource('zzzz',
version=1,
username='zzz',
password='zzz')
for s in api.get_cluster('my cluster').get_all_services():
if s.name == 'MR':
# my activities are in s.get_running_activities() I then retrieve the job ids from the MR activities, and use the Hadoop command 'mapred job -status' to asertain information about them. I am using CDH 4, and I am not currently using Yarn. Thanks!
... View more