728x90
Airflow에서는 Statsd라는 컴포넌트를 통해 Airflow의 메트릭을 Prometheus로 보내고, Grafana에서 시각적으로 확인해 볼 수 있습니다.
즉 Airflow에서 일어나는 일을 모니터링 할 수 있게 됩니다.
Airflow의 Metric에는 어떤 것들이 있는지 공식 홈페이지를 통해 확인 해보겠습니다.
그 중에서 유의깊게 봐야 할 metric에 대해서 빨간색으로 진하게(bold)처리 해두었으니, 필터링해서 보시면 될 것 같습니다.
1. Counters
Counters
- 카운터는 단순히 증가하는 값을 나타내며, 일반적으로 주어진 간격 동안의 이벤트 횟수를 추적합니다.
- 예를 들어, 요청이 서버로 들어오는 횟수나 오류가 발생한 횟수 등을 계산할 수 있습니다.
- 카운터는 보통 리셋되지 않고 지속적으로 증가합니다. 따라서 모니터링 시 카운터 값의 증가율을 관찰하여 시스템의 활동을 추적할 수 있습니다.
| 이름 | 설명 |
| <job_name>_start | Number of started <job_name> job, ex. SchedulerJob, LocalTaskJob |
| <job_name>_end | Number of ended <job_name> job, ex. SchedulerJob, LocalTaskJob |
| <job_name>_heartbeat_failure | Number of failed Heartbeats for a <job_name> job, ex. SchedulerJob, LocalTaskJob |
| local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> | Number of LocalTaskJob terminations with a <return_code> while running a task <task_id> of a DAG <dag_id>. |
| local_task_job.task_exit | Number of LocalTaskJob terminations with a <return_code> while running a task <task_id> of a DAG <dag_id>. Metric with job_id, dag_id, task_id and return_code tagging. |
| operator_failures_<operator_name> | Operator <operator_name> failures |
| operator_failures | Operator <operator_name> failures. Metric with operator_name tagging. |
| operator_successes_<operator_name> | Operator <operator_name> successes |
| operator_successes | Operator <operator_name> successes. Metric with operator_name tagging. |
| ti_failures | Overall task instances failures. Metric with dag_id and task_id tagging. |
| ti_successes | Overall task instances successes. Metric with dag_id and task_id tagging. |
| previously_succeeded | Number of previously succeeded task instances. Metric with dag_id and task_id tagging. |
| zombies_killed | Zombie tasks killed. Metric with dag_id and task_id tagging. |
| scheduler_heartbeat | Scheduler heartbeats |
| dag_processing.processes | Relative number of currently running DAG parsing processes (ie this delta is negative when, since the last metric was sent, processes have completed). Metric with file_path and action tagging. |
| dag_processing.processor_timeouts | Number of file processors that have been killed due to taking too long. Metric with file_path tagging. |
| dag_processing.sla_callback_count | Number of SLA callbacks received |
| dag_processing.other_callback_count | Number of non-SLA callbacks received |
| dag_processing.file_path_queue_update_count | Number of times we’ve scanned the filesystem and queued all existing dags |
| dag_file_processor_timeouts | (DEPRECATED) same behavior as dag_processing.processor_timeouts |
| dag_processing.manager_stalls | Number of stalled DagFileProcessorManager |
| dag_file_refresh_error | Number of failures loading any DAG files |
| scheduler.tasks.killed_externally | Number of tasks killed externally. Metric with dag_id and task_id tagging. |
| scheduler.orphaned_tasks.cleared | Number of Orphaned tasks cleared by the Scheduler |
| scheduler.orphaned_tasks.adopted | Number of Orphaned tasks adopted by the Scheduler |
| scheduler.critical_section_busy | Count of times a scheduler process tried to get a lock on the critical section (needed to send tasks to the executor) and found it locked by another process. |
| sla_missed | Number of SLA misses. Metric with dag_id and task_id tagging. |
| sla_callback_notification_failure | Number of failed SLA miss callback notification attempts. Metric with dag_id and func_name tagging. |
| sla_email_notification_failure | Number of failed SLA miss email notification attempts. Metric with dag_id tagging. |
| ti.start.<dag_id>.<task_id> | Number of started task in a given dag. Similar to <job_name>_start but for task |
| ti.start | Number of started task in a given dag. Similar to <job_name>_start but for task. Metric with dag_id and task_id tagging. |
| ti.finish.<dag_id>.<task_id>.<state> | Number of completed task in a given dag. Similar to <job_name>_end but for task |
| ti.finish | Number of completed task in a given dag. Similar to <job_name>_end but for task Metric with dag_id and task_id tagging. |
| dag.callback_exceptions | Number of exceptions raised from DAG callbacks. When this happens, it means DAG callback is not working. Metric with dag_id tagging |
| celery.task_timeout_error | Number of AirflowTaskTimeout errors raised when publishing Task to Celery Broker. |
| celery.execute_command.failure | Number of non-zero exit code from Celery task. |
| task_removed_from_dag.<dag_id> | Number of tasks removed for a given dag (i.e. task no longer exists in DAG). |
| task_removed_from_dag | Number of tasks removed for a given dag (i.e. task no longer exists in DAG). Metric with dag_id and run_type tagging. |
| task_restored_to_dag.<dag_id> | Number of tasks restored for a given dag (i.e. task instance which was previously in REMOVED state in the DB is added to DAG file) |
| task_restored_to_dag.<dag_id> | Number of tasks restored for a given dag (i.e. task instance which was previously in REMOVED state in the DB is added to DAG file). Metric with dag_id and run_type tagging. |
| task_instance_created_<operator_name> | Number of tasks instances created for a given Operator |
| task_instance_created | Number of tasks instances created for a given Operator. Metric with dag_id and run_type tagging. |
| triggerer_heartbeat | Triggerer heartbeats |
| triggers.blocked_main_thread | Number of triggers that blocked the main thread (likely due to not being fully asynchronous) |
| triggers.failed | Number of triggers that errored before they could fire an event |
| triggers.succeeded | Number of triggers that have fired at least one event |
| dataset.updates | Number of updated datasets |
| dataset.orphaned | Number of datasets marked as orphans because they are no longer referenced in DAG schedule parameters or task outlets |
| dataset.triggered_dagruns | Number of DAG runs triggered by a dataset update |
2. Gauges
Guages
- 게이지는 특정 시점에서의 값을 나타내며, 주로 변화하지 않는 값을 추적합니다.
- 예를 들어, 현재 사용 중인 메모리 양이나 디스크 공간의 남은 양 등을 게이지로 표현할 수 있습니다.
- 게이지는 시간에 따라 변하지 않는 값이므로 현재 상태를 나타내는 데 사용됩니다.
| 이름 | 설명 |
| dagbag_size | Number of DAGs found when the scheduler ran a scan based on its configuration |
| dag_processing.import_errors | Number of errors from trying to parse DAG files |
| dag_processing.total_parse_time | Seconds taken to scan and import dag_processing.file_path_queue_size DAG files |
| dag_processing.file_path_queue_size | Number of DAG files to be considered for the next scan |
| dag_processing.last_run.seconds_ago.<dag_file> | Seconds since <dag_file> was last processed |
| scheduler.tasks.starving | Number of tasks that cannot be scheduled because of no open slot in pool |
| scheduler.tasks.executable | Number of tasks that are ready for execution (set to queued) with respect to pool limits, DAG concurrency, executor state, and priority. |
| executor.open_slots | Number of open slots on executor |
| executor.queued_tasks | Number of queued tasks on executor |
| executor.running_tasks | Number of running tasks on executor |
| pool.open_slots.<pool_name> | Number of open slots in the pool |
| pool.open_slots | Number of open slots in the pool. Metric with pool_name tagging. |
| pool.queued_slots.<pool_name> | Number of queued slots in the pool |
| pool.queued_slots | Number of queued slots in the pool. Metric with pool_name tagging. |
| pool.running_slots.<pool_name> | Number of running slots in the pool |
| pool.running_slots | Number of running slots in the pool. Metric with pool_name tagging. |
| pool.deferred_slots.<pool_name> | Number of deferred slots in the pool |
| pool.deferred_slots | Number of deferred slots in the pool. Metric with pool_name tagging. |
| pool.starving_tasks.<pool_name> | Number of starving tasks in the pool |
| pool.starving_tasks | Number of starving tasks in the pool. Metric with pool_name tagging. |
| triggers.running.<hostname> | Number of triggers currently running for a triggerer (described by hostname) |
| triggers.running | Number of triggers currently running for a triggerer (described by hostname). Metric with hostname tagging. |
3. Timers
| 이름 | 설명 |
| dagrun.dependency-check.<dag_id> | Milliseconds taken to check DAG dependencies |
| dagrun.dependency-check | Milliseconds taken to check DAG dependencies. Metric with dag_id tagging. |
| dag.<dag_id>.<task_id>.duration | Seconds taken to run a task |
| task.duration | Seconds taken to run a task. Metric with dag_id and task-id tagging. |
| dag.<dag_id>.<task_id>.scheduled_duration | Seconds a task spends in the Scheduled state, before being Queued |
| task.scheduled_duration | Seconds a task spends in the Scheduled state, before being Queued. Metric with dag_id and task_id tagging. |
| dag.<dag_id>.<task_id>.queued_duration | Seconds a task spends in the Queued state, before being Running |
| task.queued_duration | Seconds a task spends in the Queued state, before being Running. Metric with dag_id and task_id tagging. |
| dag_processing.last_duration.<dag_file> | Seconds taken to load the given DAG file |
| dag_processing.last_duration | Seconds taken to load the given DAG file. Metric with file_name tagging. |
| dagrun.duration.success.<dag_id> | Seconds taken for a DagRun to reach success state |
| dagrun.duration.success | Seconds taken for a DagRun to reach success state. Metric with dag_id and run_type tagging. |
| dagrun.duration.failed.<dag_id> | Seconds taken for a DagRun to reach failed state |
| dagrun.duration.failed | Seconds taken for a DagRun to reach failed state. Metric with dag_id and run_type tagging. |
| dagrun.schedule_delay.<dag_id> | Milliseconds of delay between the scheduled DagRun start date and the actual DagRun start date |
| dagrun.schedule_delay | Milliseconds of delay between the scheduled DagRun start date and the actual DagRun start date. Metric with dag_id tagging. |
| scheduler.critical_section_duration | Milliseconds spent in the critical section of scheduler loop – only a single scheduler can enter this loop at a time |
| scheduler.critical_section_query_duration | Milliseconds spent running the critical section task instance query |
| scheduler.scheduler_loop_duration | Milliseconds spent running one scheduler loop |
| dagrun.<dag_id>.first_task_scheduling_delay | Seconds elapsed between first task start_date and dagrun expected start |
| dagrun.first_task_scheduling_delay | Seconds elapsed between first task start_date and dagrun expected start. Metric with dag_id and run_type tagging. |
| collect_db_dags | Milliseconds taken for fetching all Serialized Dags from DB |
| kubernetes_executor.clear_not_launched_queued_tasks.duration | Milliseconds taken for clearing not launched queued tasks in Kubernetes Executor |
| kubernetes_executor.adopt_task_instances.duration | Milliseconds taken to adopt the task instances in Kubernetes Executor |

참조:
728x90
댓글