Member since
09-25-2018
99
Posts
6
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2282 | 11-03-2021 02:55 AM | |
1732 | 09-21-2020 10:04 PM | |
3094 | 08-14-2020 03:20 AM | |
4200 | 08-20-2019 11:07 PM | |
9994 | 01-06-2019 07:32 PM |
08-19-2024
07:28 AM
Hello @wert_1311 Thank you for bringing this to our Community. I see this has already by requested by you here: [0a]https://community.cloudera.com/t5/Support-Questions/Monitor-alert-long-running-Airflow-jobs/m-p/388314/highlight/true#M246627 Did it not help? If not, Please try the following: Airflow has the following metrics that can be used: [0b]https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions For the integration, I'd suggest you explore the feasibility based on your use-case. For example, the Metrics configuration setup is given here for StatD and OpenTelemetry. [0c] https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metrics-configuration Also, I am citing the third-party apps that may be helpful: [0d] https://github.com/search?q=airflow+prometheus&type=repositories Let us know how it goes. V
... View more
07-16-2024
03:25 PM
@GangWar @wert_1311 I have found HDFS files that are persistently under-replicated, despite being over a year old. They are rare, but vulnerable to loss with one disk failure. To be clear, this shows the replication target, not the actual: hdfs dfs -ls filename The actual can be found with 'hdfs fsck filename -blocks -files filename' In theory, this situation should be transient, but I have found some cases. See example below where a file is 3 blocks in length and one of them only has one replica. # hdfs fsck -blocks -files /tmp/part-m-03752 OUTPUT: /tmp/part-m-03752: Under replicated BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s). /tmp/part-m-03752: Replica placement policy is violated for BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Block should be additionally replicated on 1 more rack(s). 0. BP-955733439-1.2.3.4-1395362440665:blk_1967769089_1100461809406 len=134217728 Live_repl=3 1. BP-955733439-1.2.3.4-1395362440665:blk_1967769276_1100461809593 len=134217728 Live_repl=3 2. BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792 len=40324081 Live_repl=1 Status: HEALTHY Total size: 308759537 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 3 (avg. block size 102919845 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (33.333332 %) Mis-replicated blocks: 1 (33.333332 %) Default replication factor: 3 Average block replication: 2.3333333 Corrupt blocks: 0 Missing replicas: 2 (22.222221 %) Number of data-nodes: 30 Number of racks: 3 The filesystem under path '/tmp/part-m-03752' is HEALTHY # hadoop fs -ls /tmp/part-m-03752 OUTPUT: -rw-r--r-- 3 wuser hadoop 308759537 2021-12-11 16:58 /tmp/part-m-03752 [sorry, code quoting is not working for me for some reason.] Presumably, the file was incorrectly replicated when it was written because of some failure and the defaults for dfs.client.block.write.replace-datanode-on-failure props were such that new DNs were not obtained at write time to replace ones that failed. The puzzling thing here is why does it not get re-replicated after all this time?
... View more
06-18-2024
01:00 AM
1 Kudo
You can integrate Airflow with a monitoring tool such as Prometheus or Grafana. These tools provide advanced monitoring and alerting capabilities. You can configure thresholds and receive alerts when the job runtime exceeds the specified limit. Alternatively, you can use a dedicated job monitoring tool like Apache Oozie or Azkaban, which also offer alerting mechanisms to monitor long-running jobs. These tools provide more comprehensive job management features and can be integrated with Airflow.
... View more
05-01-2024
06:42 AM
1 Kudo
Hello @wert_1311 We hope the above Post has answered your Q. We shall mark the Post as Resolved. If your Team continue to have any concerns, Feel free to Update the Post & We shall get back to your Team accordingly. - Smarak
... View more
04-15-2024
12:17 AM
1 Kudo
Hello @wert_1311 This is an Old Post, yet I am answering to ensure the Post can be used for future reference. Anytime your Team observe the above Issue, Capture the CDE Diagnostics Bundle covering the Timeframe of Issue (Example Is 2 days In The Screenshot). Next, Restart the Airflow Scheduler Pod by Deleting the Airflow Scheduler Pod, upon which the Airflow Scheduler Pod would be Recreated implicitly. Next, Engage Cloudera Support with the CDE Diagnostics Bundle captured above. Additionally, CDE Airflow has been significantly Scaled-Test in recent releases of CDE & your Team should consider Upgrading CDE to Latest Version as soon as possible. - Smarak [1] https://docs.cloudera.com/data-engineering/cloud/troubleshooting/topics/cde-download-diagnostic-bundle.html
... View more
03-02-2024
11:30 AM
1 Kudo
Hi @ wert_1311 Can. you tail -f the application logs for atlas and capture the events you notice when you try list of 'Not Classified' expanding the page size. Also you can run inspect element from your browser and share both outputs once for review. Please also validate if the behaviour is same from different browsers once. Regrads, Puneeth
... View more
02-28-2024
08:18 AM
Hi @wert_1311 Can you once "sync the user for the cluster" and then check if you are able to access the Web UI in your CDW cluster or not
... View more
02-21-2024
06:55 AM
1 Kudo
Hi @wert_1311 Thank you for reaching out to Cloudera Community, Cloudera provide an observability tool called "Observability" which help you monitor jobs within a cluster. You can refer below documentation on how this tool can be configured and used. https://docs.cloudera.com/observability/cloud/overview/topics/obs-understanding-observ.html
... View more
01-02-2024
07:27 AM
@wert_1311 Thank you for bringing this to our community. Did you get help on this? If not, Allow me to help you better: Q. What is your observation on the respective SOLR server CM agent log? You may want to lookout for keywords "Monitor-SolrServerMonitor" OR "throttling_logger" in the CM agent logs of the SOLR Server and see if you can provide more insights on this? Q. What is the exact version of CM and CDP? V
... View more