About wert_1311

vaishaakb · ‎08-19-2024

Hello @wert_1311 Thank you for bringing this to our Community. I see this has already by requested by you here: [0a]https://community.cloudera.com/t5/Support-Questions/Monitor-alert-long-running-Airflow-jobs/m-p/388314/highlight/true#M246627 Did it not help? If not, Please try the following: Airflow has the following metrics that can be used: [0b]https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metric-descriptions For the integration, I'd suggest you explore the feasibility based on your use-case. For example, the Metrics configuration setup is given here for StatD and OpenTelemetry. [0c] https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#metrics-configuration Also, I am citing the third-party apps that may be helpful: [0d] https://github.com/search?q=airflow+prometheus&type=repositories Let us know how it goes. V

pbaclace · ‎07-16-2024

@GangWar @wert_1311 I have found HDFS files that are persistently under-replicated, despite being over a year old. They are rare, but vulnerable to loss with one disk failure. To be clear, this shows the replication target, not the actual: hdfs dfs -ls filename The actual can be found with 'hdfs fsck filename -blocks -files filename' In theory, this situation should be transient, but I have found some cases. See example below where a file is 3 blocks in length and one of them only has one replica. # hdfs fsck -blocks -files /tmp/part-m-03752 OUTPUT: /tmp/part-m-03752: Under replicated BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s). /tmp/part-m-03752: Replica placement policy is violated for BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Block should be additionally replicated on 1 more rack(s). 0. BP-955733439-1.2.3.4-1395362440665:blk_1967769089_1100461809406 len=134217728 Live_repl=3 1. BP-955733439-1.2.3.4-1395362440665:blk_1967769276_1100461809593 len=134217728 Live_repl=3 2. BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792 len=40324081 Live_repl=1 Status: HEALTHY Total size: 308759537 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 3 (avg. block size 102919845 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (33.333332 %) Mis-replicated blocks: 1 (33.333332 %) Default replication factor: 3 Average block replication: 2.3333333 Corrupt blocks: 0 Missing replicas: 2 (22.222221 %) Number of data-nodes: 30 Number of racks: 3 The filesystem under path '/tmp/part-m-03752' is HEALTHY # hadoop fs -ls /tmp/part-m-03752 OUTPUT: -rw-r--r-- 3 wuser hadoop 308759537 2021-12-11 16:58 /tmp/part-m-03752 [sorry, code quoting is not working for me for some reason.] Presumably, the file was incorrectly replicated when it was written because of some failure and the defaults for dfs.client.block.write.replace-datanode-on-failure props were such that new DNs were not obtained at write time to replace ones that failed. The puzzling thing here is why does it not get re-replicated after all this time?

wjsandman · ‎06-18-2024

I too am having this issue. Cloudera!!! How about a reply here?

Kartik_Agarwal · ‎06-18-2024

You can integrate Airflow with a monitoring tool such as Prometheus or Grafana. These tools provide advanced monitoring and alerting capabilities. You can configure thresholds and receive alerts when the job runtime exceeds the specified limit. Alternatively, you can use a dedicated job monitoring tool like Apache Oozie or Azkaban, which also offer alerting mechanisms to monitor long-running jobs. These tools provide more comprehensive job management features and can be integrated with Airflow.

smdas · ‎05-01-2024

Hello @wert_1311 We hope the above Post has answered your Q. We shall mark the Post as Resolved. If your Team continue to have any concerns, Feel free to Update the Post & We shall get back to your Team accordingly. - Smarak

smdas · ‎04-15-2024

Hello @wert_1311 This is an Old Post, yet I am answering to ensure the Post can be used for future reference. Anytime your Team observe the above Issue, Capture the CDE Diagnostics Bundle covering the Timeframe of Issue (Example Is 2 days In The Screenshot). Next, Restart the Airflow Scheduler Pod by Deleting the Airflow Scheduler Pod, upon which the Airflow Scheduler Pod would be Recreated implicitly. Next, Engage Cloudera Support with the CDE Diagnostics Bundle captured above. Additionally, CDE Airflow has been significantly Scaled-Test in recent releases of CDE & your Team should consider Upgrading CDE to Latest Version as soon as possible. - Smarak [1] https://docs.cloudera.com/data-engineering/cloud/troubleshooting/topics/cde-download-diagnostic-bundle.html

Puni · ‎03-02-2024

Hi @ wert_1311 Can. you tail -f the application logs for atlas and capture the events you notice when you try list of 'Not Classified' expanding the page size. Also you can run inspect element from your browser and share both outputs once for review. Please also validate if the behaviour is same from different browsers once. Regrads, Puneeth

jAnshula · ‎02-28-2024

Hi @wert_1311 Can you once "sync the user for the cluster" and then check if you are able to access the Web UI in your CDW cluster or not

jAnshula · ‎02-21-2024

Hi @wert_1311 Thank you for reaching out to Cloudera Community, Cloudera provide an observability tool called "Observability" which help you monitor jobs within a cluster. You can refer below documentation on how this tool can be configured and used. https://docs.cloudera.com/observability/cloud/overview/topics/obs-understanding-observ.html

vaishaakb · ‎01-02-2024

@wert_1311 Thank you for bringing this to our community. Did you get help on this? If not, Allow me to help you better: Q. What is your observation on the respective SOLR server CM agent log? You may want to lookout for keywords "Monitor-SolrServerMonitor" OR "throttling_logger" in the CM agent logs of the SOLR Server and see if you can provide more insights on this? Q. What is the exact version of CM and CDP? V

Online	Offline
Last Visited	‎01-17-2025 08:00 AM

Member Since	‎09-25-2018 02:43 AM
Last Visited	‎01-17-2025 08:00 AM
Posts	99
Kudos received	6

Cloudera Community

Re: Impala Error updating the catalog due to lock ...

Re: Question regarding dns entries

Re: Event Server storage

Re: Hue Error

Re: Swap Memory Alerts

Re: Capture airflow run duration

Re: Check replication factor for a directory in hd...

Re: CDP Data Flow Error

Re: Monitor / alert long running Airflow jobs.

Re: CDE CLI Date argument

Re: Airflow scheduler does not appear to be runnin...

Re: Atlas Search Issue

Re: Impala WebUI Login Issue

Re: Impala Long running Queries

Re: Solr Web Metric Error