Created 11-25-2021 08:26 PM
안녕하세요
아래와 같은 오류메시지로 지속적으로 발생하고 있습니다.
SMM이 CM쪽으로 API 요청에서부터 timeout이 발생하는 것으로 보입니다.
전체적인 가이드 요청드립니다.
TimePeriod : LAST_ONE_WEEK, Error while fetching cluster metrics : [MetricDescriptor{metricName=MetricName(name=sum(kafka_bytes_fetched_by_partition_rate), tags=[partition, serviceName, topic], valueType=LONG, singlePointOfValue=true), queryTags={serviceName=kafka, topic=%, partition=%}, aggrFunction=SUM, postProcessFunction=null, valueType=LONG}, MetricDescriptor{metricName=MetricName(name=sum(kafka_messages_received_by_partition_rate), tags=[partition, serviceName, topic], valueType=LONG, singlePointOfValue=true), queryTags={serviceName=kafka, topic=%, partition=%}, aggrFunction=SUM, postProcessFunction=null, valueType=LONG}, MetricDescriptor{metricName=MetricName(name=sum(kafka_bytes_received_by_partition_rate), tags=[partition, serviceName, topic], valueType=LONG, singlePointOfValue=true), queryTags={serviceName=kafka, topic=%, partition=%}, aggrFunction=SUM, postProcessFunction=null, valueType=LONG}] com.hortonworks.smm.kafka.services.common.errors.InvalidCMApiResponseException: Invalid response returned CM API: http://icahubkafka005.datahub.skhynix.com:7180/api/v32/timeseries, response.status: 500,response.message: { "message" : "java.util.concurrent.TimeoutException" } at com.hortonworks.smm.kafka.services.metric.cm.CMMetricsFetcher.cmApiCall(CMMetricsFetcher.java:389) at com.hortonworks.smm.kafka.services.metric.cm.CMMetricsFetcher.cmApiPost(CMMetricsFetcher.java:368) at com.hortonworks.smm.kafka.services.metric.cm.CMMetricsFetcher.getMetricsFromCmApi(CMMetricsFetcher.java:479) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1699) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) at java.util.stream.AbstractTask.compute(AbstractTask.java:316) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:474) at com.hortonworks.smm.kafka.services.metric.cm.CMMetricsFetcher.queryMetrics(CMMetricsFetcher.java:464) at com.hortonworks.smm.kafka.services.metric.cm.CMMetricsFetcher.getClusterMetrics(CMMetricsFetcher.java:184) at com.hortonworks.smm.kafka.services.metric.cache.MetricsCache$RefreshMetricsCacheTask.lambda$null$21(MetricsCache.java:623) at com.hortonworks.smm.kafka.services.metric.cache.MetricsCache$RefreshMetricsCacheTask.fetchMetrics(MetricsCache.java:575) at com.hortonworks.smm.kafka.services.metric.cache.MetricsCache$RefreshMetricsCacheTask.lambda$refreshClusterMetrics$22(MetricsCache.java:622) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) |
Created 12-02-2021 05:11 AM
Hello,
This timeout exceptions relates to CM Metrics Store (firehose) being overloaded.
Please review the below article - https://community.cloudera.com/t5/Customer/How-to-enable-the-entity-summary-servlet-in-Cloudera-Mana...
Check KAFKA_PRODUCER and KAFKA_CONSUMER, if we have too many entities (millions), this might cause SMON to request a lot of memory to process the metrics causing timeout exceptions in the SMM server.
Alternatively Resetting/deleting the Firehose LevelDB storage could be an option to recover from this.
If the SMM server is getting timeout exceptions, check the SMM heap size, it’s recommended (depending on the number of resources we are monitoring) to increase this, acceptable values for production environments are between 8~16GB for SMM.
Created 12-06-2021 09:50 PM
@jaeseung, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,