This wiki is a WIP. Please feel free to modify this page with relevant information
The intention of this wiki is to list the metrics and events that should be monitored or collected within the NFVI. In addition to the metrics/events collected about the NFVI, some information about the monitoring process (the process which collects the information and metrics) itself is also required.
This list should be developed in conjunction with the Doctor (Faults) and VES Projects in OPNFV.
Metrics/Events Format
It's important to define a common format that can be used for the list of identified metrics and events that should be monitored/collected in the NFVI.
- Name
- Where the Metric/Event is collected (Host/Guest/Both)
- Scope of coverage
- Unit(s) of measure or associated severities
- Definition
- Method of Measurement
- Sources of Error
- Comments
Distinction between metrics and events
Information to be collected in conjunction with NFVI Metrics/Events
It's essential to collect some information about the environment that is being monitored as well as the monitoring process(es) themselves in order to associate the mertrics/events with the relevant host.
Host information:
Each host in a deployment should have a Unique identifier that distinguishes it from all other hosts. A UUID can be used in this case.
Monitoring Process information:
Each monitoring process in a deployment should have a Unique Process identifier.
Each monitoring process in a deployment should support the following events:
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
Heartbeat/ping | Host/Guest (where the monitoring process is running) | liveliness check | N/A | Heartbeat/ping to check liveliness of monitoring process | external ping |
Each monitoring process in a deployment should support the following Metrics:
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
| Host/Guest (where the monitoring process is running) | The monitoring application being used | The number of metrics currently in the write queue. | ||||
| Host/Guest (where the monitoring process is running) | The monitoring application being used | The number of metrics dropped due to a queue length limitation. | ||||
| Host/Guest (where the monitoring process is running) | The monitoring application being used | The number of elements in the metric cache | ||||
CPU utilization | Host/Guest (where the monitoring process is running) | The CPUs that are being used by the monitoring application | Nanoseconds or percentage of total CPU utilization | The CPU utilization of the monitoring process | |||
Memory Utilization | Host/Guest (where the monitoring process is running) | The Memory that is being used by the monitoring application | The amount of physical RAM, in kibibytes, used by the monitoring application |
Timing Information
NFVI Other/Additional Information
BIOS information
NFVI Events
What about entire node and switch failures? In terms of service affecting priority, host and switch failures are at the top as they can affect the most VMs / Containers / VNFs...
While the status of switches and hosts might be the domain of services that have a system-wide view, a host-resident component might be part of the monitoring functionality.
Compute
At a minimum the following events should be monitored:
- Machine check exceptions (System, Processor, Memory...) [TODO: Break this down further]
- DIMM corrected and uncorrected Errors
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
Networking
At a minimum the following events should be monitored for a Networking interface:
- Link Status
- Dropped Receive Packets – An increasing count could indicate the failure or service interruption of an upstream processes.
vSwitch liveliness
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
Storage
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
NFVI Metrics
Compute
At a minimum the following metrics should be collected:
- CPU utilization TODO: Break this down further]
- vCPU utilization TODO: Break this down further]
- Memory utilization TODO: Break this down further]
- vMemory utilization TODO: Break this down further]
- Cache utilization
- Hits
- Misses
- Instructions per clock (IPC)
- Last level cache utilization
- Memory Bandwidth utilization
- Platform Metrics (thermals, fan-speed) [TODO: Break this down further]
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
cpu_idle | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time the host CPU spends idle. | |||
cpu_nice | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time the host CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness. | |||
cpu_interrupt | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | ||||
cpu_softirq | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | ||||
cpu_steal | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | ||||
cpu_system | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | ||||
cpu_user | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | ||||
cpu_wait | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage |
Networking
[TODO] Add a note on the vSwitch and add vSwitch specific metrics
At a minimum the following metrics should be collected for a Networking interface:
- Total Packets received and transmitted
- Total Octets (TX and RX)
- Dropped packets (TX and RX)
- Errored frames (TX and RX) [TODO: Break this down further – just tried to do that...]
- Frame Check Sequence Errors or CRC Errors
- Runts (frames <64 octets in length)
- Giants (frames >6000 octets in length)
- Broadcast Packets (TX and RX)
- Multicast Packets (TX and RX)
Other Metrics that should be collected for a Networking interface (if possible):
- Average bitrate
- Average latency
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
Storage
Disk Utilization
Name | Collection location | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|
The host CPUs, individually or total usage summed across all CPUs