Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Expert Contributor

Running Apache Airflow on Kubernetes brings incredible scalability, but when DAGs stall, tasks disappear, or the scheduler goes silent, navigating the infrastructure layer can be daunting.

This guide breaks down essential debugging commands into six critical pillars, explaining why you need them, when to use them, and what to look out for during an incident.

1. Kubernetes Pods, Nodes, and Events

Before digging into Airflow application logs, check the health of your underlying Kubernetes infrastructure. Pod evictions, scheduling failures, or resource constraints are often the root cause of "mysterious" task failures.

List pods and states

kubectl get pods -n <namespace> -o wide
  • When to use: Your first line of defense when tasks aren't picking up or the UI is unresponsive.

  • What to look for: Look for pods in CrashLoopBackOff, OOMKilled, or Pending states. The -o wide flag gives you the IP and node name, helping you identify if a specific Kubernetes node is failing.

Identify issues in a specific pod

kubectl describe pod <pod_name> -n <namespace>
  • When to use: When a scheduler, worker, or webserver pod is stuck initializing, failing, or refusing to terminate.
  • What to look for: Scroll to the "Events" section at the bottom. It will reveal reasons for container failures, image pull errors, liveness/readiness probe failures, or resource limits being hit.

Check cluster events sorted by the most recent

kubectl get events -n <namespace> --sort-by='.lastTimestamp'
  • When to use: When multiple things are breaking at once and you need a chronological timeline of cluster issues.
  • What to look for: Look for warnings related to failed scheduling, node pressure, or API server connection drops.

Check resource usage

kubectl top pods -n <namespace>
kubectl top nodes
  • When to use: When tasks are lagging or pods are randomly dying due to suspected Out-Of-Memory (OOM) errors.
  • What to look for: Identify memory or CPU spikes. If a node is at 95%+ utilization, Kubernetes will start evicting Airflow workers.

2. Scheduler and Worker Logs

The Airflow Scheduler is the brain of your operation. When it gets bogged down or loses connection to the Kubernetes API, your entire data pipeline stalls.

View recent scheduler logs

kubectl logs <scheduler_pod> -n <namespace> -c scheduler --tail=500
  •  When to use: When DAGs stop triggering or tasks are stuck in the queued state.
  • What to look for: Look for database connection timeouts or heavy parsing loops.

Filter scheduler logs for stuck tasks or churn

kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "queued|scheduling|state mismatch"
  •  When to use: When tasks are infinitely stuck in the queue or fluctuating between states without executing.
  • What to look for: Look for logs indicating Airflow is repeatedly setting a task to queued but failing to hand it off to the executor.

Check worker pod logs

kubectl logs <worker_pod> -n <namespace> --tail=200
  •  When to use: When a specific task fails immediately upon starting, or when debugging Celery/Kubernetes workers.
  • What to look for: Python tracebacks, missing environment variables, or dependency errors specific to your DAG execution.

Check for Kubernetes Executor watcher timeouts

kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -c "ReadTimeoutError"
  •  When to use: If you notice a high count of tasks failing or dropping off without any specific error inside the task logs themselves.
  • What to look for: A high count indicates that the scheduler’s connection to the Kubernetes API server is timing out, meaning it is losing track of running pods.

Check for critical watcher deaths or chunking errors

kubectl logs <scheduler_pod> -n <namespace> -c scheduler | grep -i "InvalidChunkLength\|watcher.*died"
  •  When to use: When the scheduler completely stops processing Kubernetes executor pods.
  • What to look for: These errors indicate a broken HTTP stream between Airflow and Kubernetes. If the watcher dies, the scheduler won't know if a task finished or failed.

3. Storage, Volumes, and Disk Usage

Airflow pods rely heavily on disks for logs, DAG parsing, and local XCom processing. A full disk will crash your metadata database or stop logs from writing.

List Persistent Volume Claims (PVCs)

kubectl get pvc -n <namespace>
  •  When to use: When pods fail to start with a VolumeBinding error.
  • What to look for: Ensure the status is Bound. If it's Pending, your cloud provider or storage class is failing to provision the requested disk.

Check general disk space inside a pod

kubectl exec -it <pod> -n <namespace> -- df -h
  • When to use: When you suspect a pod is stalling because its underlying storage or ephemeral storage is full.
  • What to look for: Check if any mounted filesystem is at 100% utilization.

Check disk usage specifically for Airflow logs

 
kubectl exec -it <pod> -n <namespace> -- du -sh /usr/local/airflow/logs
  • When to use: When you run a persistent worker or remote log syncing fails, causing local log storage to swell over time.

  • What to look for: The total size. If it's taking up gigabytes of data, it’s time to implement a log retention policy or a log-clearing sidecar.

Check MySQL database disk usage

 
kubectl exec -it <mysql_pod> -n <namespace> -- du -sh /var/lib/mysql/
  • When to use: When the Airflow UI is incredibly slow or the database begins throwing disk write errors.

  • What to look for: Check if the database size matches expectations. If it is unexpectedly massive, your XComs or task instance history tables are bloating.

4. Airflow CLI Commands

Sometimes the fastest way to troubleshoot or unblock a stuck state is by talking directly to Airflow via its built-in CLI inside the container.

List all DAGs

airflow dags list
  • When to use: To verify if the scheduler has successfully compiled and recognized your DAG files.

  • What to look for: If your new DAG isn't listed here, it means there is either a syntax/import error or a DAG synchronization issue.

Delete an orphaned or specific DAG

 
airflow dags delete <dag_id>
  • When to use: When a deleted code file leaves a "ghost" DAG in the UI that keeps throwing errors, or when you need to completely wipe a DAG's history.

  • What to look for: Confirming the deletion removes all associated task instances from the metadata database.

Check the exact state of a task

 
airflow tasks state <dag_id> <task_id> <execution_date>
  • When to use: When the webserver UI shows a conflicting status or lags behind reality, and you need the source-of-truth state.

  • What to look for: Returns exact database states like success, running, failed, or queued.

5. Database Connectivity and MySQL

Airflow is notorious for hammering its metadata database. Connection pool exhaustion or bloated binary logs can completely paralyze your cluster.

Test DB connectivity from the scheduler pod

 
kubectl exec -it <scheduler_pod> -n <namespace> -- python3 -c "import MySQLdb; MySQLdb.connect(host='<host>', user='<user>', passwd='<pass>', db='<db>').close()"
  • When to use: When the scheduler logs complain about database connection failures, helping you determine if it's a network/credential issue or an application config bug.

  • What to look for: If it throws a Connection Refused or Access Denied error, the issue lies in your K8s network policies or credentials. If it returns nothing, the connection is healthy.

View active MySQL processes

 
mysql -h <host> -u <user> -p -e "SHOW PROCESSLIST;"
  • When to use: When Airflow locks up completely or runs excruciatingly slow.

  • What to look for: Look for long-running queries or a massive amount of sleeping connections, which implies connection pool leaks from the workers/scheduler.

Show MySQL binary logs

 
mysql -h <host> -u <user> -p -e "SHOW BINARY LOGS;"
  • When to use: Useful if the database server runs out of disk space due to high transactional volume (common when XComs or heavy task scheduling are running).

  • What to look for: A massive list of log files eating up disk space on your DB instance.

Purge binary logs older than 7 days

 
PURGE BINARY LOGS BEFORE NOW() - INTERVAL 7 DAY;
  • When to use: Emergency maintenance when your MySQL database disk hits 100% due to un-rotated binary logs.

  • What to look for: This immediately frees up disk space by deleting older, non-essential transaction logs.

Grant Airflow user privileges

 
GRANT CREATE, ALTER, DROP, INDEX, REFERENCES ON <airflow_db>.* TO '<airflow_user>'@'%'; FLUSH PRIVILEGES;
  • When to use: During Airflow upgrades (airflow db upgrade) when the migration scripts fail due to missing schema modification permissions.

  • What to look for: Resolves Access denied for user errors during database initialization.

6. Networking, DNS, and SSL

Airflow tasks regularly communicate with external APIs, cloud data warehouses, and internal services. Network misconfigurations are incredibly common in Kubernetes.

Check CoreDNS pods for gateway/DNS issues

 
kubectl get pods -n kube-system | grep coredns
  • When to use: When multiple Airflow tasks suddenly fail with "Host not found" or DNS resolution errors.

  • What to look for: Ensure the CoreDNS pods are running and haven't restarted frequently, which indicates internal cluster DNS crashing.

Test internal DNS resolution

 
kubectl exec -it <pod> -n <namespace> -- nslookup <fqdn>
  • When to use: When an Airflow pod can't reach your database, Git repo, or an external API.

  • What to look for: See if it successfully resolves the IP address. If it fails, your Kubernetes DNS or CoreDNS routing is broken.

Test SSL certificate verification to external systems

 
kubectl exec -it <pod> -n <namespace> -- openssl s_client -connect <host>:<port> -CAfile /etc/ssl/certs/ca-certificates.crt
  • When to use: When external API tasks fail with SSL: CERTIFICATE_VERIFY_FAILED.

  • What to look for: Check if the handshake is successful (Verification: OK). If it fails, you may need to mount custom corporate CA certificates into your Airflow Docker image.

Identify upstream timeouts from ingress controllers

kubectl logs <ingress_controller_pod> -n <namespace> | grep "upstream timed out"
  • When to use: When you try to download huge log files or load massive DAG graphs in the Airflow UI, and you get a 504 Gateway Timeout.

  • What to look for: Confirms if the webserver took too long to reply, meaning you need to increase the timeout limit on your Kubernetes Ingress controller.

7. File System and DAG Verification

A classic point of confusion: your code is updated in Git, but the Airflow UI doesn't show the changes. These commands help verify if your DAG files actually reached the containers.

Compare filesystem DAGs versus the UI count

 
kubectl exec -it <scheduler_pod> -n <namespace> -- ls /usr/local/airflow/dags/ | wc -l
  • When to use: When there is a mismatch between what you see in your code repository and what is rendering on the Airflow Webserver UI.

  • What to look for: If this count matches your repo but the UI doesn't, the scheduler is failing to parse the files. If this count is lower, your DAG syncing mechanism (like Git-Sync or a Shared PVC) is broken.

List actual file locations and verify resource mounts inside the pod

kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /usr/local/airflow/dags/
kubectl exec -it <scheduler_pod> -n <namespace> -- ls -la /app/mount/<resource_name>/
  • When to use: When tasks fail with FileNotFoundError or permissions issues for mounted data folders/plugins.

  • What to look for: Verify file ownership permissions (root vs airflow user) and ensure symlinks or remote volumes are correctly mounted and visible to the application runtime.

33 Views
0 Kudos
Version history
Last update:
‎06-10-2026 01:45 AM
Updated by: