Member since
12-20-2022
90
Posts
19
Kudos Received
10
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1086 | 05-08-2025 06:27 AM | |
| 1269 | 04-02-2025 11:35 PM | |
| 894 | 03-23-2025 11:30 PM | |
| 1187 | 03-06-2025 10:11 PM | |
| 2093 | 10-29-2024 11:53 PM |
06-10-2026
01:45 AM
Running Apache Airflow on Kubernetes brings incredible scalability, but when DAGs stall, tasks disappear, or the scheduler goes silent, navigating the infrastructure layer can be daunting.
This guide breaks down essential debugging commands into six critical pillars, explaining why you need them, when to use them, and what to look out for during an incident.
1. Kubernetes Pods, Nodes, and Events
Before digging into Airflow application logs, check the health of your underlying Kubernetes infrastructure. Pod evictions, scheduling failures, or resource constraints are often the root cause of "mysterious" task failures.
List pods and states
kubectl get pods -n <namespace> -o wide
When to use: Your first line of defense when tasks aren't picking up or the UI is unresponsive.
What to look for: Look for pods in CrashLoopBackOff, OOMKilled, or Pending states. The -o wide flag gives you the IP and node name, helping you identify if a specific Kubernetes node is failing.
Identify issues in a specific pod
kubectl describe pod <pod_name> -n <namespace>
When to use: When a scheduler, worker, or webserver pod is stuck initializing, failing, or refusing to terminate.
What to look for: Scroll to the "Events" section at the bottom. It will reveal reasons for container failures, image pull errors, liveness/readiness probe failures, or resource limits being hit.
Check cluster events sorted by the most recent
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
When to use: When multiple things are breaking at once and you need a chronological timeline of cluster issues.
What to look for: Look for warnings related to failed scheduling, node pressure, or API server connection drops.
Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes
When to use: When tasks are lagging or pods are randomly dying due to suspected Out-Of-Memory (OOM) errors.
What to look for: Identify memory or CPU spikes. If a node is at 95%+ utilization, Kubernetes will start evicting Airflow workers.
2. Scheduler and Worker Logs
The Airflow Scheduler is the brain of your operation. When it gets bogged down or loses connection to the Kubernetes API, your entire data pipeline stalls.
View recent scheduler logs
kubectl logs <scheduler_pod> -n <namespace> -c scheduler --tail=500
When to use: When DAGs stop triggering or tasks are stuck in the queued state.
What to look for: Look for database connection timeouts or heavy parsing loops.
Filter scheduler logs for stuck tasks or churn
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "queued|scheduling|state mismatch"
When to use: When tasks are infinitely stuck in the queue or fluctuating between states without executing.
What to look for: Look for logs indicating Airflow is repeatedly setting a task to queued but failing to hand it off to the executor.
Check worker pod logs
kubectl logs <worker_pod> -n <namespace> --tail=200
When to use: When a specific task fails immediately upon starting, or when debugging Celery/Kubernetes workers.
What to look for: Python tracebacks, missing environment variables, or dependency errors specific to your DAG execution.
Check for Kubernetes Executor watcher timeouts
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -c "ReadTimeoutError"
When to use: If you notice a high count of tasks failing or dropping off without any specific error inside the task logs themselves.
What to look for: A high count indicates that the scheduler’s connection to the Kubernetes API server is timing out, meaning it is losing track of running pods.
Check for critical watcher deaths or chunking errors
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "InvalidChunkLength\|watcher.*died"
When to use: When the scheduler completely stops processing Kubernetes executor pods.
What to look for: These errors indicate a broken HTTP stream between Airflow and Kubernetes. If the watcher dies, the scheduler won't know if a task finished or failed.
3. Storage, Volumes, and Disk Usage
Airflow pods rely heavily on disks for logs, DAG parsing, and local XCom processing. A full disk will crash your metadata database or stop logs from writing.
List Persistent Volume Claims (PVCs)
kubectl get pvc -n <namespace>
When to use: When pods fail to start with a VolumeBinding error.
What to look for: Ensure the status is Bound. If it's Pending, your cloud provider or storage class is failing to provision the requested disk.
Check general disk space inside a pod
kubectl exec -it <pod> -n <namespace> -- df -h
When to use: When you suspect a pod is stalling because its underlying storage or ephemeral storage is full.
What to look for: Check if any mounted filesystem is at 100% utilization.
Check disk usage specifically for Airflow logs
kubectl exec -it <pod> -n <namespace> -- du -sh /usr/local/airflow/logs
When to use: When you run a persistent worker or remote log syncing fails, causing local log storage to swell over time.
What to look for: The total size. If it's taking up gigabytes of data, it’s time to implement a log retention policy or a log-clearing sidecar.
Check MySQL database disk usage
kubectl exec -it <mysql_pod> -n <namespace> -- du -sh /var/lib/mysql/
When to use: When the Airflow UI is incredibly slow or the database begins throwing disk write errors.
What to look for: Check if the database size matches expectations. If it is unexpectedly massive, your XComs or task instance history tables are bloating.
4. Airflow CLI Commands
Sometimes the fastest way to troubleshoot or unblock a stuck state is by talking directly to Airflow via its built-in CLI inside the container.
List all DAGs
airflow dags list
When to use: To verify if the scheduler has successfully compiled and recognized your DAG files.
What to look for: If your new DAG isn't listed here, it means there is either a syntax/import error or a DAG synchronization issue.
Delete an orphaned or specific DAG
airflow dags delete <dag_id>
When to use: When a deleted code file leaves a "ghost" DAG in the UI that keeps throwing errors, or when you need to completely wipe a DAG's history.
What to look for: Confirming the deletion removes all associated task instances from the metadata database.
Check the exact state of a task
airflow tasks state <dag_id> <task_id> <execution_date>
When to use: When the webserver UI shows a conflicting status or lags behind reality, and you need the source-of-truth state.
What to look for: Returns exact database states like success, running, failed, or queued.
5. Database Connectivity and MySQL
Airflow is notorious for hammering its metadata database. Connection pool exhaustion or bloated binary logs can completely paralyze your cluster.
Test DB connectivity from the scheduler pod
kubectl exec -it <scheduler_pod> -n <namespace> -- python3 -c "import MySQLdb; MySQLdb.connect(host='<host>', user='<user>', passwd='<pass>', db='<db>').close()"
When to use: When the scheduler logs complain about database connection failures, helping you determine if it's a network/credential issue or an application config bug.
What to look for: If it throws a Connection Refused or Access Denied error, the issue lies in your K8s network policies or credentials. If it returns nothing, the connection is healthy.
View active MySQL processes
mysql -h <host> -u <user> -p -e "SHOW PROCESSLIST;"
When to use: When Airflow locks up completely or runs excruciatingly slow.
What to look for: Look for long-running queries or a massive amount of sleeping connections, which implies connection pool leaks from the workers/scheduler.
Show MySQL binary logs
mysql -h <host> -u <user> -p -e "SHOW BINARY LOGS;"
When to use: Useful if the database server runs out of disk space due to high transactional volume (common when XComs or heavy task scheduling are running).
What to look for: A massive list of log files eating up disk space on your DB instance.
Purge binary logs older than 7 days
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY;
When to use: Emergency maintenance when your MySQL database disk hits 100% due to un-rotated binary logs.
What to look for: This immediately frees up disk space by deleting older, non-essential transaction logs.
Grant Airflow user privileges
GRANT CREATE, ALTER, DROP, INDEX, REFERENCES ON <airflow_db>.* TO '<airflow_user>'@'%'; FLUSH PRIVILEGES;
When to use: During Airflow upgrades (airflow db upgrade) when the migration scripts fail due to missing schema modification permissions.
What to look for: Resolves Access denied for user errors during database initialization.
6. Networking, DNS, and SSL
Airflow tasks regularly communicate with external APIs, cloud data warehouses, and internal services. Network misconfigurations are incredibly common in Kubernetes.
Check CoreDNS pods for gateway/DNS issues
kubectl get pods -n kube-system | grep coredns
When to use: When multiple Airflow tasks suddenly fail with "Host not found" or DNS resolution errors.
What to look for: Ensure the CoreDNS pods are running and haven't restarted frequently, which indicates internal cluster DNS crashing.
Test internal DNS resolution
kubectl exec -it <pod> -n <namespace> -- nslookup <fqdn>
When to use: When an Airflow pod can't reach your database, Git repo, or an external API.
What to look for: See if it successfully resolves the IP address. If it fails, your Kubernetes DNS or CoreDNS routing is broken.
Test SSL certificate verification to external systems
kubectl exec -it <pod> -n <namespace> -- openssl s_client -connect <host>:<port> -CAfile /etc/ssl/certs/ca-certificates.crt
When to use: When external API tasks fail with SSL: CERTIFICATE_VERIFY_FAILED.
What to look for: Check if the handshake is successful (Verification: OK). If it fails, you may need to mount custom corporate CA certificates into your Airflow Docker image.
Identify upstream timeouts from ingress controllers
kubectl logs <ingress_controller_pod> -n <namespace> | grep "upstream timed out"
When to use: When you try to download huge log files or load massive DAG graphs in the Airflow UI, and you get a 504 Gateway Timeout.
What to look for: Confirms if the webserver took too long to reply, meaning you need to increase the timeout limit on your Kubernetes Ingress controller.
7. File System and DAG Verification
A classic point of confusion: your code is updated in Git, but the Airflow UI doesn't show the changes. These commands help verify if your DAG files actually reached the containers.
Compare filesystem DAGs versus the UI count
kubectl exec -it <scheduler_pod> -n <namespace> -- ls /usr/local/airflow/dags/ | wc -l
When to use: When there is a mismatch between what you see in your code repository and what is rendering on the Airflow Webserver UI.
What to look for: If this count matches your repo but the UI doesn't, the scheduler is failing to parse the files. If this count is lower, your DAG syncing mechanism (like Git-Sync or a Shared PVC) is broken.
List actual file locations and verify resource mounts inside the pod
kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /usr/local/airflow/dags/
kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /app/mount/<resource_name>/
When to use: When tasks fail with FileNotFoundError or permissions issues for mounted data folders/plugins.
What to look for: Verify file ownership permissions (root vs airflow user) and ensure symlinks or remote volumes are correctly mounted and visible to the application runtime.
... View more
Labels:
06-01-2026
04:00 AM
Spark Python Supportability Matrix The Spark Python Supportability Matrix serves as an essential tool for determining which Python versions are compatible with specific Spark versions. This matrix provides a detailed overview of the compatibility levels for various Python versions across different Spark releases. Spark Version Python Min Supported Version Python Max Supported Version Python v 2.7 Python v3.4 Python v3.5 Python v 3.6 Python v3.7 Python v3.8 Python v3.9 Python v3.10 Python v3.11 3.5.5 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.4 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.3 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.2 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.1 3.8 3.11 No No No No No Yes Yes Yes Yes 3.5.0 3.8 3.11 No No No No No Yes Yes Yes Yes 3.4.2 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.1 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.4.0 3.7 3.11 No No No No Yes Yes Yes Yes Yes 3.3.3 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.2 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.1 3.7 3.10 No No No No Yes Yes Yes Yes No 3.3.0 3.7 3.10 No No No No Yes Yes Yes Yes No 3.2.4 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.2.0 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.3 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.2 3.6 3.9 No No No Yes Yes Yes Yes No No 3.1.1 3.6 3.9 No No No Yes Yes Yes Yes No No 3.0.3 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.2 2.7/3.4 3.9 Yes Yes Yes Yes Yes Yes Yes No No 3.0.1 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 3.0.0 2.7/3.4 3.8 Yes Yes Yes Yes Yes Yes No No No 2.4.8 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.7 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.6 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.5 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.1 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.4.0 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.4 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.3 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.2 2.7/3.4 3.7 Yes Yes Yes Yes Yes No No No No 2.3.1 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.3.0 2.7/3.4 3.6 Yes Yes Yes Yes No No No No No 2.2.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.2.0 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.3 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.2 2.7/3.4 3.5 Yes Yes Yes No No No No No No 2.1.1 2.7/3.4 3.5 Yes Yes Yes No No No No No No Note: The above data is generated using https://pypi.org/project/pyspark/ website. If you face any problems with supported python environment share in comments so that we can put some notes.
... View more
10-22-2025
01:09 PM
Minimum User Percentage and User Limit Factor are ways to control how resources get assigned to users within the queues they are utilising. The Min User Percentage is a soft limit on the smallest amount of resources a single user should get access to if they are requesting it. For a specific queue Minimum User Limit Percentage(MULP) is a soft limit on the smallest amount of resources every user will get. This MULP is decided on the basis of how many concurrent users we are expecting to run job on a particular queue. Setting this to 10% is ideal as it will give around 10 users to have minimum of 10% of the queue minimum capacity configured. Setting the config for MULP is also based on the Active and Non Active users. Active users are the users who are requesting for more resources and Non active users are the users who are running their job but not requesting more resources. Generally the Idea is to calculate the MULP for the active users: active-user-limit = max(resource-used-by-active-users / active-users, queue-capacity * MULP) For Example: 5 users, 5 apps, MULP=20, Queue-configured-resource=100 App: a1, a2, a3, a4, a5 Usr: u1, u2, u3, u4, u5 At the time=T, resource usage: a1=25,a2=20,a3=30,a4=20,a5=5; a1/a2 are active user. This will give result as 22.5 so the user a2 will get the resources but a1 is already crossed that limit to get the available resources. For setting User Limit Factor (ULF) that is the max limit a user will get in a particular queue. User Limit Factor is a way to control the max amount of resources that a single user can consume. User Limit Factor is set as a multiple of the queues minimum capacity where a user limit factor of 1 means the user can consume the entire minimum capacity of the queue. A common design point that may initially be non-intuitive is creation of queues by workloads and not by applications and then using the user-limit-factor to prevent individual takeover of queues by a single user by using a value of less than 1.0
... View more
Labels:
05-08-2025
06:27 AM
Hi @anonymous_123 , Generally the RM heap calculation depends on the yarn.resourcemanager.max-completed-applications value and the number of applications running daily. Default value for yarn.resourcemanager.max-completed-applications is 10000 but if you see that you dont have enough applications running you can set this to 6000. Regarding 4GB heap that is production level RM heap and it is fine if you are not seeing any heap related errors.
... View more
04-15-2025
01:25 AM
Hi @Jaguar , Can you please get the RM logs and grep with Ranger in RM and check that. Do you have the cm_yarn service plugin setup in Ranger?
... View more
04-02-2025
11:35 PM
1 Kudo
Hi @anonymous_123 , Yes you can use Iceberg Table with Spark and to authorise with Ranger. You need to set two permissions one for the Iceberg Metadata files and One for global policy to give permission to iceberg on all tables. Please follow this document https://docs.cloudera.com/runtime/7.3.1/iceberg-how-to/topics/iceberg-setup-ranger.html
... View more
04-02-2025
10:07 PM
Hi @satvaddi , Please follow the below actions to setup the policies in RAZ for Spark. Spark doesnt have any plugin of its own so the data accessed on S3 will be logged. Other than that the table metadata will be logged from HMS. Running the create external table [***table definition***] location ‘s3a://bucket/data/logs/tabledata’ command in Hive requires the following Ranger policies: An S3 policy in the cm_s3 repo on s3a://bucket/data/logs/tabledata for hive user to perform recursive read/write. An S3 policy in the cm_s3 repo on s3a://bucket/data/logs/tabledata for the end user. A Hive URL authorization policy in the Hadoop SQL repo on s3a://bucket/data/logs/tabledata for the end user. Access to the same external table location using Spark shell requires an S3 policy (Ranger policy) in the cm_s3 repo on s3a://bucket/data/logs/tabledata for the end user.
... View more
03-24-2025
01:56 AM
In YARN, resource allocation discrepancies can occur due to the way resource calculation is handled. By default, resource availability is determined based on available memory. However, when CPU scheduling is enabled, resource calculation considers both available memory and vCores. As a result, in some scenarios, nodes may appear to allocate more vCores than the configured limit while simultaneously displaying lower available resources. This happens due to the way YARN dynamically assigns vCores based on workload demands rather than strictly adhering to preconfigured limits. Additionally, in cases where CPU scheduling is disabled, YARN relies solely on memory-based resource calculation. This may lead to negative values appearing in the YARN UI, which can be safely ignored, as they do not impact actual resource utilization.
... View more
Labels:
03-23-2025
11:30 PM
1 Kudo
No the job wont fail as by default the work preserve is enabled on YARN Resource Manager and Node Manager.
... View more
03-06-2025
10:11 PM
Hi @sdbags , You can recover the corrupted block if you have set the replication factor to default of 3.
... View more