Created on 06-10-2026 01:45 AM
Running Apache Airflow on Kubernetes brings incredible scalability, but when DAGs stall, tasks disappear, or the scheduler goes silent, navigating the infrastructure layer can be daunting.
This guide breaks down essential debugging commands into six critical pillars, explaining why you need them, when to use them, and what to look out for during an incident.
Before digging into Airflow application logs, check the health of your underlying Kubernetes infrastructure. Pod evictions, scheduling failures, or resource constraints are often the root cause of "mysterious" task failures.
kubectl get pods -n <namespace> -o wide
When to use: Your first line of defense when tasks aren't picking up or the UI is unresponsive.
What to look for: Look for pods in CrashLoopBackOff, OOMKilled, or Pending states. The -o wide flag gives you the IP and node name, helping you identify if a specific Kubernetes node is failing.
kubectl describe pod <pod_name> -n <namespace>
What to look for: Scroll to the "Events" section at the bottom. It will reveal reasons for container failures, image pull errors, liveness/readiness probe failures, or resource limits being hit.
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
What to look for: Look for warnings related to failed scheduling, node pressure, or API server connection drops.
kubectl top pods -n <namespace>
kubectl top nodes
What to look for: Identify memory or CPU spikes. If a node is at 95%+ utilization, Kubernetes will start evicting Airflow workers.
The Airflow Scheduler is the brain of your operation. When it gets bogged down or loses connection to the Kubernetes API, your entire data pipeline stalls.
kubectl logs <scheduler_pod> -n <namespace> -c scheduler --tail=500
What to look for: Look for database connection timeouts or heavy parsing loops.
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "queued|scheduling|state mismatch"
What to look for: Look for logs indicating Airflow is repeatedly setting a task to queued but failing to hand it off to the executor.
kubectl logs <worker_pod> -n <namespace> --tail=200
What to look for: Python tracebacks, missing environment variables, or dependency errors specific to your DAG execution.
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -c "ReadTimeoutError"
What to look for: A high count indicates that the scheduler’s connection to the Kubernetes API server is timing out, meaning it is losing track of running pods.
kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "InvalidChunkLength\|watcher.*died"
What to look for: These errors indicate a broken HTTP stream between Airflow and Kubernetes. If the watcher dies, the scheduler won't know if a task finished or failed.
Airflow pods rely heavily on disks for logs, DAG parsing, and local XCom processing. A full disk will crash your metadata database or stop logs from writing.
kubectl get pvc -n <namespace>
What to look for: Ensure the status is Bound. If it's Pending, your cloud provider or storage class is failing to provision the requested disk.
kubectl exec -it <pod> -n <namespace> -- df -h
What to look for: Check if any mounted filesystem is at 100% utilization.
kubectl exec -it <pod> -n <namespace> -- du -sh /usr/local/airflow/logs
When to use: When you run a persistent worker or remote log syncing fails, causing local log storage to swell over time.
What to look for: The total size. If it's taking up gigabytes of data, it’s time to implement a log retention policy or a log-clearing sidecar.
kubectl exec -it <mysql_pod> -n <namespace> -- du -sh /var/lib/mysql/
When to use: When the Airflow UI is incredibly slow or the database begins throwing disk write errors.
What to look for: Check if the database size matches expectations. If it is unexpectedly massive, your XComs or task instance history tables are bloating.
Sometimes the fastest way to troubleshoot or unblock a stuck state is by talking directly to Airflow via its built-in CLI inside the container.
airflow dags list
When to use: To verify if the scheduler has successfully compiled and recognized your DAG files.
What to look for: If your new DAG isn't listed here, it means there is either a syntax/import error or a DAG synchronization issue.
airflow dags delete <dag_id>
When to use: When a deleted code file leaves a "ghost" DAG in the UI that keeps throwing errors, or when you need to completely wipe a DAG's history.
What to look for: Confirming the deletion removes all associated task instances from the metadata database.
airflow tasks state <dag_id> <task_id> <execution_date>
When to use: When the webserver UI shows a conflicting status or lags behind reality, and you need the source-of-truth state.
What to look for: Returns exact database states like success, running, failed, or queued.
Airflow is notorious for hammering its metadata database. Connection pool exhaustion or bloated binary logs can completely paralyze your cluster.
kubectl exec -it <scheduler_pod> -n <namespace> -- python3 -c "import MySQLdb; MySQLdb.connect(host='<host>', user='<user>', passwd='<pass>', db='<db>').close()"
When to use: When the scheduler logs complain about database connection failures, helping you determine if it's a network/credential issue or an application config bug.
What to look for: If it throws a Connection Refused or Access Denied error, the issue lies in your K8s network policies or credentials. If it returns nothing, the connection is healthy.
mysql -h <host> -u <user> -p -e "SHOW PROCESSLIST;"
When to use: When Airflow locks up completely or runs excruciatingly slow.
What to look for: Look for long-running queries or a massive amount of sleeping connections, which implies connection pool leaks from the workers/scheduler.
mysql -h <host> -u <user> -p -e "SHOW BINARY LOGS;"
When to use: Useful if the database server runs out of disk space due to high transactional volume (common when XComs or heavy task scheduling are running).
What to look for: A massive list of log files eating up disk space on your DB instance.
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY;
When to use: Emergency maintenance when your MySQL database disk hits 100% due to un-rotated binary logs.
What to look for: This immediately frees up disk space by deleting older, non-essential transaction logs.
GRANT CREATE, ALTER, DROP, INDEX, REFERENCES ON <airflow_db>.* TO '<airflow_user>'@'%'; FLUSH PRIVILEGES;
When to use: During Airflow upgrades (airflow db upgrade) when the migration scripts fail due to missing schema modification permissions.
What to look for: Resolves Access denied for user errors during database initialization.
Airflow tasks regularly communicate with external APIs, cloud data warehouses, and internal services. Network misconfigurations are incredibly common in Kubernetes.
kubectl get pods -n kube-system | grep coredns
When to use: When multiple Airflow tasks suddenly fail with "Host not found" or DNS resolution errors.
What to look for: Ensure the CoreDNS pods are running and haven't restarted frequently, which indicates internal cluster DNS crashing.
kubectl exec -it <pod> -n <namespace> -- nslookup <fqdn>
When to use: When an Airflow pod can't reach your database, Git repo, or an external API.
What to look for: See if it successfully resolves the IP address. If it fails, your Kubernetes DNS or CoreDNS routing is broken.
kubectl exec -it <pod> -n <namespace> -- openssl s_client -connect <host>:<port> -CAfile /etc/ssl/certs/ca-certificates.crt
When to use: When external API tasks fail with SSL: CERTIFICATE_VERIFY_FAILED.
What to look for: Check if the handshake is successful (Verification: OK). If it fails, you may need to mount custom corporate CA certificates into your Airflow Docker image.
kubectl logs <ingress_controller_pod> -n <namespace> | grep "upstream timed out"
When to use: When you try to download huge log files or load massive DAG graphs in the Airflow UI, and you get a 504 Gateway Timeout.
What to look for: Confirms if the webserver took too long to reply, meaning you need to increase the timeout limit on your Kubernetes Ingress controller.
A classic point of confusion: your code is updated in Git, but the Airflow UI doesn't show the changes. These commands help verify if your DAG files actually reached the containers.
kubectl exec -it <scheduler_pod> -n <namespace> -- ls /usr/local/airflow/dags/ | wc -l
When to use: When there is a mismatch between what you see in your code repository and what is rendering on the Airflow Webserver UI.
What to look for: If this count matches your repo but the UI doesn't, the scheduler is failing to parse the files. If this count is lower, your DAG syncing mechanism (like Git-Sync or a Shared PVC) is broken.
kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /usr/local/airflow/dags/
kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /app/mount/<resource_name>/
When to use: When tasks fail with FileNotFoundError or permissions issues for mounted data folders/plugins.
What to look for: Verify file ownership permissions (root vs airflow user) and ensure symlinks or remote volumes are correctly mounted and visible to the application runtime.