About AyazHussain

AyazHussain · ‎06-10-2026

Running Apache Airflow on Kubernetes brings incredible scalability, but when DAGs stall, tasks disappear, or the scheduler goes silent, navigating the infrastructure layer can be daunting. This guide breaks down essential debugging commands into six critical pillars, explaining why you need them, when to use them, and what to look out for during an incident. 1. Kubernetes Pods, Nodes, and Events Before digging into Airflow application logs, check the health of your underlying Kubernetes infrastructure. Pod evictions, scheduling failures, or resource constraints are often the root cause of "mysterious" task failures. List pods and states kubectl get pods -n <namespace> -o wide When to use: Your first line of defense when tasks aren't picking up or the UI is unresponsive. What to look for: Look for pods in CrashLoopBackOff, OOMKilled, or Pending states. The -o wide flag gives you the IP and node name, helping you identify if a specific Kubernetes node is failing. Identify issues in a specific pod kubectl describe pod <pod_name> -n <namespace> When to use: When a scheduler, worker, or webserver pod is stuck initializing, failing, or refusing to terminate. What to look for: Scroll to the "Events" section at the bottom. It will reveal reasons for container failures, image pull errors, liveness/readiness probe failures, or resource limits being hit. Check cluster events sorted by the most recent kubectl get events -n <namespace> --sort-by='.lastTimestamp' When to use: When multiple things are breaking at once and you need a chronological timeline of cluster issues. What to look for: Look for warnings related to failed scheduling, node pressure, or API server connection drops. Check resource usage kubectl top pods -n <namespace> kubectl top nodes When to use: When tasks are lagging or pods are randomly dying due to suspected Out-Of-Memory (OOM) errors. What to look for: Identify memory or CPU spikes. If a node is at 95%+ utilization, Kubernetes will start evicting Airflow workers. 2. Scheduler and Worker Logs The Airflow Scheduler is the brain of your operation. When it gets bogged down or loses connection to the Kubernetes API, your entire data pipeline stalls. View recent scheduler logs kubectl logs <scheduler_pod> -n <namespace> -c scheduler --tail=500 When to use: When DAGs stop triggering or tasks are stuck in the queued state. What to look for: Look for database connection timeouts or heavy parsing loops. Filter scheduler logs for stuck tasks or churn kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "queued|scheduling|state mismatch" When to use: When tasks are infinitely stuck in the queue or fluctuating between states without executing. What to look for: Look for logs indicating Airflow is repeatedly setting a task to queued but failing to hand it off to the executor. Check worker pod logs kubectl logs <worker_pod> -n <namespace> --tail=200 When to use: When a specific task fails immediately upon starting, or when debugging Celery/Kubernetes workers. What to look for: Python tracebacks, missing environment variables, or dependency errors specific to your DAG execution. Check for Kubernetes Executor watcher timeouts kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -c "ReadTimeoutError" When to use: If you notice a high count of tasks failing or dropping off without any specific error inside the task logs themselves. What to look for: A high count indicates that the scheduler’s connection to the Kubernetes API server is timing out, meaning it is losing track of running pods. Check for critical watcher deaths or chunking errors kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "InvalidChunkLength\|watcher.*died" When to use: When the scheduler completely stops processing Kubernetes executor pods. What to look for: These errors indicate a broken HTTP stream between Airflow and Kubernetes. If the watcher dies, the scheduler won't know if a task finished or failed. 3. Storage, Volumes, and Disk Usage Airflow pods rely heavily on disks for logs, DAG parsing, and local XCom processing. A full disk will crash your metadata database or stop logs from writing. List Persistent Volume Claims (PVCs) kubectl get pvc -n <namespace> When to use: When pods fail to start with a VolumeBinding error. What to look for: Ensure the status is Bound. If it's Pending, your cloud provider or storage class is failing to provision the requested disk. Check general disk space inside a pod kubectl exec -it <pod> -n <namespace> -- df -h When to use: When you suspect a pod is stalling because its underlying storage or ephemeral storage is full. What to look for: Check if any mounted filesystem is at 100% utilization. Check disk usage specifically for Airflow logs kubectl exec -it <pod> -n <namespace> -- du -sh /usr/local/airflow/logs When to use: When you run a persistent worker or remote log syncing fails, causing local log storage to swell over time. What to look for: The total size. If it's taking up gigabytes of data, it’s time to implement a log retention policy or a log-clearing sidecar. Check MySQL database disk usage kubectl exec -it <mysql_pod> -n <namespace> -- du -sh /var/lib/mysql/ When to use: When the Airflow UI is incredibly slow or the database begins throwing disk write errors. What to look for: Check if the database size matches expectations. If it is unexpectedly massive, your XComs or task instance history tables are bloating. 4. Airflow CLI Commands Sometimes the fastest way to troubleshoot or unblock a stuck state is by talking directly to Airflow via its built-in CLI inside the container. List all DAGs airflow dags list When to use: To verify if the scheduler has successfully compiled and recognized your DAG files. What to look for: If your new DAG isn't listed here, it means there is either a syntax/import error or a DAG synchronization issue. Delete an orphaned or specific DAG airflow dags delete <dag_id> When to use: When a deleted code file leaves a "ghost" DAG in the UI that keeps throwing errors, or when you need to completely wipe a DAG's history. What to look for: Confirming the deletion removes all associated task instances from the metadata database. Check the exact state of a task airflow tasks state <dag_id> <task_id> <execution_date> When to use: When the webserver UI shows a conflicting status or lags behind reality, and you need the source-of-truth state. What to look for: Returns exact database states like success, running, failed, or queued. 5. Database Connectivity and MySQL Airflow is notorious for hammering its metadata database. Connection pool exhaustion or bloated binary logs can completely paralyze your cluster. Test DB connectivity from the scheduler pod kubectl exec -it <scheduler_pod> -n <namespace> -- python3 -c "import MySQLdb; MySQLdb.connect(host='<host>', user='<user>', passwd='<pass>', db='<db>').close()" When to use: When the scheduler logs complain about database connection failures, helping you determine if it's a network/credential issue or an application config bug. What to look for: If it throws a Connection Refused or Access Denied error, the issue lies in your K8s network policies or credentials. If it returns nothing, the connection is healthy. View active MySQL processes mysql -h <host> -u <user> -p -e "SHOW PROCESSLIST;" When to use: When Airflow locks up completely or runs excruciatingly slow. What to look for: Look for long-running queries or a massive amount of sleeping connections, which implies connection pool leaks from the workers/scheduler. Show MySQL binary logs mysql -h <host> -u <user> -p -e "SHOW BINARY LOGS;" When to use: Useful if the database server runs out of disk space due to high transactional volume (common when XComs or heavy task scheduling are running). What to look for: A massive list of log files eating up disk space on your DB instance. Purge binary logs older than 7 days PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY; When to use: Emergency maintenance when your MySQL database disk hits 100% due to un-rotated binary logs. What to look for: This immediately frees up disk space by deleting older, non-essential transaction logs. Grant Airflow user privileges GRANT CREATE, ALTER, DROP, INDEX, REFERENCES ON <airflow_db>.* TO '<airflow_user>'@'%'; FLUSH PRIVILEGES; When to use: During Airflow upgrades (airflow db upgrade) when the migration scripts fail due to missing schema modification permissions. What to look for: Resolves Access denied for user errors during database initialization. 6. Networking, DNS, and SSL Airflow tasks regularly communicate with external APIs, cloud data warehouses, and internal services. Network misconfigurations are incredibly common in Kubernetes. Check CoreDNS pods for gateway/DNS issues kubectl get pods -n kube-system | grep coredns When to use: When multiple Airflow tasks suddenly fail with "Host not found" or DNS resolution errors. What to look for: Ensure the CoreDNS pods are running and haven't restarted frequently, which indicates internal cluster DNS crashing. Test internal DNS resolution kubectl exec -it <pod> -n <namespace> -- nslookup <fqdn> When to use: When an Airflow pod can't reach your database, Git repo, or an external API. What to look for: See if it successfully resolves the IP address. If it fails, your Kubernetes DNS or CoreDNS routing is broken. Test SSL certificate verification to external systems kubectl exec -it <pod> -n <namespace> -- openssl s_client -connect <host>:<port> -CAfile /etc/ssl/certs/ca-certificates.crt When to use: When external API tasks fail with SSL: CERTIFICATE_VERIFY_FAILED. What to look for: Check if the handshake is successful (Verification: OK). If it fails, you may need to mount custom corporate CA certificates into your Airflow Docker image. Identify upstream timeouts from ingress controllers kubectl logs <ingress_controller_pod> -n <namespace> | grep "upstream timed out" When to use: When you try to download huge log files or load massive DAG graphs in the Airflow UI, and you get a 504 Gateway Timeout. What to look for: Confirms if the webserver took too long to reply, meaning you need to increase the timeout limit on your Kubernetes Ingress controller. 7. File System and DAG Verification A classic point of confusion: your code is updated in Git, but the Airflow UI doesn't show the changes. These commands help verify if your DAG files actually reached the containers. Compare filesystem DAGs versus the UI count kubectl exec -it <scheduler_pod> -n <namespace> -- ls /usr/local/airflow/dags/ | wc -l When to use: When there is a mismatch between what you see in your code repository and what is rendering on the Airflow Webserver UI. What to look for: If this count matches your repo but the UI doesn't, the scheduler is failing to parse the files. If this count is lower, your DAG syncing mechanism (like Git-Sync or a Shared PVC) is broken. List actual file locations and verify resource mounts inside the pod kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /usr/local/airflow/dags/ kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /app/mount/<resource_name>/ When to use: When tasks fail with FileNotFoundError or permissions issues for mounted data folders/plugins. What to look for: Verify file ownership permissions (root vs airflow user) and ensure symlinks or remote volumes are correctly mounted and visible to the application runtime.

AyazHussain · ‎06-01-2026

Spark Python Supportability Matrix The Spark Python Supportability Matrix serves as an essential tool for determining which Python versions are compatible with specific Spark versions. This matrix provides a detailed overview of the compatibility levels for various Python versions across different Spark releases. Spark Version Python Min Supported Version Python Max Supported Version Python v 2.7 Python v3.4 Python v3.5 Python v 3.6 Python v3.7 Python v3.8 Python v3.9 Python v3.10 Python v3.11 3.5.5 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.4 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.3 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.2 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.1 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.0 3.8 3.11 No No No No No Yes Yes Yes Yes 3.4.2 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.1 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.0 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.3.3 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.2 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.1 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.0 3.7 3.10 No No No No Yes Yes Yes Yes No 3.2.4 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.0 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.0.3 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.2 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.1 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 2.4.8 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.7 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.6 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.5 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.1 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.0 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.1 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.3.0 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.2.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.0 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No Note: The above data is generated using https://pypi.org/project/pyspark/ website. If you face any problems with supported python environment share in comments so that we can put some notes.

AyazHussain · ‎10-22-2025

Minimum User Percentage and User Limit Factor are ways to control how resources get assigned to users within the queues they are utilising. The Min User Percentage is a soft limit on the smallest amount of resources a single user should get access to if they are requesting it. For a specific queue Minimum User Limit Percentage(MULP) is a soft limit on the smallest amount of resources every user will get. This MULP is decided on the basis of how many concurrent users we are expecting to run job on a particular queue. Setting this to 10% is ideal as it will give around 10 users to have minimum of 10% of the queue minimum capacity configured. Setting the config for MULP is also based on the Active and Non Active users. Active users are the users who are requesting for more resources and Non active users are the users who are running their job but not requesting more resources. Generally the Idea is to calculate the MULP for the active users: active-user-limit = max(resource-used-by-active-users / active-users, queue-capacity * MULP) For Example: 5 users, 5 apps, MULP=20, Queue-configured-resource=100 App: a1, a2, a3, a4, a5 Usr: u1, u2, u3, u4, u5 At the time=T, resource usage: a1=25,a2=20,a3=30,a4=20,a5=5; a1/a2 are active user. This will give result as 22.5 so the user a2 will get the resources but a1 is already crossed that limit to get the available resources. For setting User Limit Factor (ULF) that is the max limit a user will get in a particular queue. User Limit Factor is a way to control the max amount of resources that a single user can consume. User Limit Factor is set as a multiple of the queues minimum capacity where a user limit factor of 1 means the user can consume the entire minimum capacity of the queue. A common design point that may initially be non-intuitive is creation of queues by workloads and not by applications and then using the user-limit-factor to prevent individual takeover of queues by a single user by using a value of less than 1.0

AyazHussain · ‎05-08-2025

Hi @anonymous_123 , Generally the RM heap calculation depends on the yarn.resourcemanager.max-completed-applications value and the number of applications running daily. Default value for yarn.resourcemanager.max-completed-applications is 10000 but if you see that you dont have enough applications running you can set this to 6000. Regarding 4GB heap that is production level RM heap and it is fine if you are not seeing any heap related errors.

AyazHussain · ‎04-15-2025

Hi @Jaguar , Can you please get the RM logs and grep with Ranger in RM and check that. Do you have the cm_yarn service plugin setup in Ranger?

AyazHussain · ‎04-02-2025

Hi @anonymous_123 , Yes you can use Iceberg Table with Spark and to authorise with Ranger. You need to set two permissions one for the Iceberg Metadata files and One for global policy to give permission to iceberg on all tables. Please follow this document https://docs.cloudera.com/runtime/7.3.1/iceberg-how-to/topics/iceberg-setup-ranger.html

AyazHussain · ‎04-02-2025

Hi @satvaddi , Please follow the below actions to setup the policies in RAZ for Spark. Spark doesnt have any plugin of its own so the data accessed on S3 will be logged. Other than that the table metadata will be logged from HMS. Running the create external table [***table definition***] location ‘s3a://bucket/data/logs/tabledata’ command in Hive requires the following Ranger policies: An S3 policy in the cm_s3 repo on s3a://bucket/data/logs/tabledata for hive user to perform recursive read/write. An S3 policy in the cm_s3 repo on s3a://bucket/data/logs/tabledata for the end user. A Hive URL authorization policy in the Hadoop SQL repo on s3a://bucket/data/logs/tabledata for the end user. Access to the same external table location using Spark shell requires an S3 policy (Ranger policy) in the cm_s3 repo on s3a://bucket/data/logs/tabledata for the end user.

AyazHussain · ‎03-24-2025

In YARN, resource allocation discrepancies can occur due to the way resource calculation is handled. By default, resource availability is determined based on available memory. However, when CPU scheduling is enabled, resource calculation considers both available memory and vCores. As a result, in some scenarios, nodes may appear to allocate more vCores than the configured limit while simultaneously displaying lower available resources. This happens due to the way YARN dynamically assigns vCores based on workload demands rather than strictly adhering to preconfigured limits. Additionally, in cases where CPU scheduling is disabled, YARN relies solely on memory-based resource calculation. This may lead to negative values appearing in the YARN UI, which can be safely ignored, as they do not impact actual resource utilization.

AyazHussain · ‎03-23-2025

No the job wont fail as by default the work preserve is enabled on YARN Resource Manager and Node Manager.

AyazHussain · ‎03-06-2025

Hi @sdbags , You can recover the corrupted block if you have set the replication factor to default of 3.

Online	Offline
Last Visited	‎07-27-2026 03:45 AM

Member Since	‎12-20-2022 08:28 AM
Last Visited	‎07-27-2026 03:45 AM
Posts	90
Kudos received	21

Cloudera Community

Re: Resource manager heap calculation

Re: Iceberg with Ranger

Re: Node Manager Down

Re: Sachin Duggal : Can Block-Level Data Be Used t...

Re: How to get total_io_mb of eatch applications i...

Kubernetes Airflow Troubleshooting Cheat Sheet

Spark Python Supportability Matrix

User Limit Factor and Minimum User Limit Percentag...

Re: Resource manager heap calculation

Re: Spark job in cdp 7.2.18 RangerRaz not generati...

Re: Iceberg with Ranger

Re: Spark job in cdp 7.2.18 RangerRaz not generati...

Reserved Memory and Vcores in negative value in Ya...

Re: Node Manager Down

Re: Sachin Duggal : Can Block-Level Data Be Used t...