Apache Mesos Monitoring
Use This InfluxDB Integration for FreeApache Mesos is an open-source project to manage computer clusters. It abstracts CPU, memory, storage and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be built and run effectively.
Why use the Apache Mesos Telegraf Plugin?
The Apache Mesos Telegraf Plugin allows you to collect observability metrics provided by the Mesos master and agent nodes and insert them into your InfluxDB instance. The plugin can collect a set of metrics that enable cluster operators to monitor resource usage and detect issues before they become a problem.
How to monitor Apache Mesos using the Telegraf plugin
The Apache Mesos Telegraf Plugin will collect metrics from Apache Mesos and insert them into InfluxDB. By default, this plugin is not configured to gather metrics from Mesos since a cluster can be deployed in numerous ways. You will need to specify master/slave nodes for this plugin to gather metrics from.
Key Apache Mesos metrics to use for monitoring
Some of the important Apache Mesos metrics that you should proactively monitor include:
Resources:
master/cpus_percent
Percentage of allocated CPUsmaster/cpus_used
Number of allocated CPUsmaster/cpus_total
Number of CPUsmaster/cpus_revocable_percent
Percentage of allocated revocable CPUsmaster/cpus_revocable_total
Number of revocable CPUsmaster/cpus_revocable_used
Number of allocated revocable CPUsmaster/disk_percent
Percentage of allocated disk spacemaster/disk_used
Allocated disk space in MBmaster/disk_total
Disk space in MBmaster/disk_revocable_percent
Percentage of allocated revocable disk spacemaster/disk_revocable_total
Revocable disk space in MBmaster/disk_revocable_used
Allocated revocable disk space in MBmaster/gpus_percent
Percentage of allocated GPUsmaster/gpus_used
Number of allocated GPUsmaster/gpus_total
Number of GPUsmaster/gpus_revocable_percent
Percentage of allocated revocable GPUsmaster/gpus_revocable_total
Number of revocable GPUsmaster/gpus_revocable_used
Number of allocated revocable GPUsmaster/mem_percent
Percentage of allocated memorymaster/mem_used
Allocated memory in MBmaster/mem_total
Memory in MBmaster/mem_revocable_percent
Percentage of allocated revocable memorymaster/mem_revocable_total
Revocable memory in MBmaster/mem_revocable_used
Allocated revocable memory in MB
Master
master/elected
Whether this is the elected mastermaster/uptime_secs
Uptime in seconds
System
system/cpus_total
Number of CPUs available in this master nodesystem/load_15min
Load average for the past 15 minutessystem/load_5min
Load average for the past 5 minutessystem/load_1min
Load average for the past minutesystem/mem_free_bytes
Free memory in bytessystem/mem_total_bytes
Total memory in bytes
Slaves
master/slave_registrations
master/slave_removals
master/slave_reregistrations
master/slave_shutdowns_scheduled
master/slave_shutdowns_canceled
master/slave_shutdowns_completed
master/slaves_active
master/slaves_connected
master/slaves_disconnected
master/slaves_inactive
master/slave_unreachable_canceled
master/slave_unreachable_completed
master/slave_unreachable_scheduled
master/slaves_unreachable
frameworks
master/frameworks_active
master/frameworks_connected
master/frameworks_disconnected
master/frameworks_inactive
master/outstanding_offers
framework offers
master/frameworks/subscribed
master/frameworks/calls_total
master/frameworks/calls
master/frameworks/events_total
master/frameworks/events
master/frameworks/operations_total
master/frameworks/operations
master/frameworks/tasks/active
master/frameworks/tasks/terminal
master/frameworks/offers/sent
master/frameworks/offers/accepted
master/frameworks/offers/declined
master/frameworks/offers/rescinded
master/frameworks/roles/suppressed
tasks
master/tasks_error
master/tasks_failed
master/tasks_finished
master/tasks_killed
master/tasks_lost
master/tasks_running
master/tasks_staging
master/tasks_starting
master/tasks_dropped
master/tasks_gone
master/tasks_gone_by_operator
master/tasks_killing
master/tasks_unreachable
messages
master/invalid_executor_to_framework_messages
master/invalid_framework_to_executor_messages
master/invalid_status_update_acknowledgements
master/invalid_status_updates
master/dropped_messages
master/messages_authenticate
master/messages_deactivate_framework
master/messages_decline_offers
master/messages_executor_to_framework
master/messages_exited_executor
master/messages_framework_to_executor
master/messages_kill_task
master/messages_launch_tasks
master/messages_reconcile_tasks
master/messages_register_framework
master/messages_register_slave
master/messages_reregister_framework
master/messages_reregister_slave
master/messages_resource_request
master/messages_revive_offers
master/messages_status_update
master/messages_status_update_acknowledgement
master/messages_unregister_framework
master/messages_unregister_slave
master/messages_update_slave
master/recovery_slave_removals
master/slave_removals/reason_registered
master/slave_removals/reason_unhealthy
master/slave_removals/reason_unregistered
master/valid_framework_to_executor_messages
master/valid_status_update_acknowledgements
master/valid_status_updates
master/task_lost/source_master/reason_invalid_offers
master/task_lost/source_master/reason_slave_removed
master/task_lost/source_slave/reason_executor_terminated
master/valid_executor_to_framework_messages
master/invalid_operation_status_update_acknowledgements
master/messages_operation_status_update_acknowledgement
master/messages_reconcile_operations
master/messages_suppress_offers
master/valid_operation_status_update_acknowledgements
evqueue
master/event_queue_dispatches
master/event_queue_http_requests
master/event_queue_messages
master/operator_event_stream_subscribers
registrar
registrar/state_fetch_ms
registrar/state_store_ms
registrar/state_store_ms/max
registrar/state_store_ms/min
registrar/state_store_ms/p50
registrar/state_store_ms/p90
registrar/state_store_ms/p95
registrar/state_store_ms/p99
registrar/state_store_ms/p999
registrar/state_store_ms/p9999
registrar/state_store_ms/count
registrar/log/ensemble_size
registrar/log/recovered
registrar/queued_operations
registrar/registry_size_bytes
allocator
allocator/allocation_run_ms
allocator/allocation_run_ms/count
allocator/allocation_run_ms/max
allocator/allocation_run_ms/min
allocator/allocation_run_ms/p50
allocator/allocation_run_ms/p90
allocator/allocation_run_ms/p95
allocator/allocation_run_ms/p99
allocator/allocation_run_ms/p999
allocator/allocation_run_ms/p9999
allocator/allocation_runs
allocator/allocation_run_latency_ms
allocator/allocation_run_latency_ms/count
allocator/allocation_run_latency_ms/max
allocator/allocation_run_latency_ms/min
allocator/allocation_run_latency_ms/p50
allocator/allocation_run_latency_ms/p90
allocator/allocation_run_latency_ms/p95
allocator/allocation_run_latency_ms/p99
allocator/allocation_run_latency_ms/p999
allocator/allocation_run_latency_ms/p9999
allocator/roles/shares/dominant
allocator/event_queue_dispatches
allocator/offer_filters/roles/active
allocator/quota/roles/resources/offered_or_allocated
allocator/quota/roles/resources/guarantee
allocator/resources/cpus/offered_or_allocated
allocator/resources/cpus/total
allocator/resources/disk/offered_or_allocated
allocator/resources/disk/total
allocator/resources/mem/offered_or_allocated
allocator/resources/mem/total
Mesos slave metric groups
- resources
slave/cpus_percent
slave/cpus_used
slave/cpus_total
slave/cpus_revocable_percent
slave/cpus_revocable_total
slave/cpus_revocable_used
slave/disk_percent
slave/disk_used
slave/disk_total
slave/disk_revocable_percent
slave/disk_revocable_total
slave/disk_revocable_used
slave/gpus_percent
slave/gpus_used
slave/gpus_total,
slave/gpus_revocable_percent
slave/gpus_revocable_total
slave/gpus_revocable_used
slave/mem_percent
slave/mem_used
slave/mem_total
slave/mem_revocable_percent
slave/mem_revocable_total
slave/mem_revocable_used
- agent
slave/registered
slave/uptime_secs
- system
system/cpus_total
system/load_15min
system/load_5min
system/load_1min
system/mem_free_bytes
system/mem_total_bytes
- executors
containerizer/mesos/container_destroy_errors
slave/container_launch_errors
slave/executors_preempted
slave/frameworks_active
slave/executor_directory_max_allowed_age_secs
slave/executors_registering
slave/executors_running
slave/executors_terminated
slave/executors_terminating
slave/recovery_errors
- tasks
slave/tasks_failed
slave/tasks_finished
slave/tasks_killed
slave/tasks_lost
slave/tasks_running
slave/tasks_staging
slave/tasks_starting
- messages
slave/invalid_framework_messages
slave/invalid_status_updates
slave/valid_framework_messages
slave/valid_status_updates
You can learn more about Apache Meso metrics on their documentation page.