...
Each monitoring process in a deployment should support the following events:
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
Heartbeat/ping | Host/Guest (where the monitoring process is running) | ping frequency and size of packet | liveliness check | N/A | Heartbeat/ping to check liveliness of monitoring process | external ping | false alarm for host due to network interruption |
Each monitoring process in a deployment should support the following Metrics:
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
| Host/Guest (where the monitoring process is running) | measurement frequency | The monitoring application being used | The number of metrics currently in the write queue. | ||||
| Host/Guest (where the monitoring process is running) | measurement frequency | The monitoring application being used | The number of metrics dropped due to a queue length limitation. | ||||
| Host/Guest (where the monitoring process is running) | measurement frequency | The monitoring application being used | The number of elements in the metric cache | ||||
CPU utilization | Host/Guest (where the monitoring process is running) | measurement frequency, interrupt frequency, set of execution contexts, time of measurement | The CPUs that are being used by the monitoring application | Nanoseconds or percentage of total CPU utilization | The CPU utilization of the monitoring process | kernel interrupt to read current execution context | short-lived contexts may come and go between interrupts | see section 6 of TST008 |
Memory Utilization | Host/Guest (where the monitoring process is running) | Time of measurement, total memory available, swap space configured | The Memory that is being used by the monitoring application | Kibibytes | The amount of physical RAM, in kibibytes, used by the monitoring application | memory management reports current values at time of measurement | see section 8 of TST008 |
NFVI Other/Additional Information
...
- Machine check exceptions (System, Processor, Memory...) [TODO: Break this down further]
- DIMM corrected and uncorrected Errors
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
MCEs | Host | Memory, CPU, IO | Machine Check Exception | using mcelog | ||||
PCIe Errors | Host |
Networking
At a minimum the following events should be monitored for a Networking interface:
- Link Status
- Dropped Receive Packets – An increasing count could indicate the failure or service interruption of an upstream processes.
vSwitch liveliness
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
Link Status | ||||||||
vSwitch Status (liveliness) | ||||||||
Packet Processing Core Status |
Storage
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
NFVI Metrics
Compute
At a minimum the following metrics should be collected:
- CPU utilization TODO: Break this down further]
- vCPU utilization TODO: Break this down further]
- Memory utilization TODO: Break this down further]
- vMemory utilization TODO: Break this down further]
- Cache utilization
- Hits
- Misses
- Instructions per clock (IPC)
- Last level cache utilization
- Memory Bandwidth utilization
- Platform Metrics (thermals, fan-speed) [TODO: Break this down further]
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
cpu_idle | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time the host CPU spends idle. | see CPU Utilization above, and section 6 of TST008 | |||
cpu_nice | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time the host CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness. | see CPU Utilization above, and section 6 of TST008 | |||
cpu_interrupt | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time the CPU has spent servicing (hardware) interrupts. | see CPU Utilization above, and section 6 of TST008 | |||
cpu_softirq | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref] | see CPU Utilization above, and section 6 of TST008 | |||
cpu_steal | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | CPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.” It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle. It is time that went missing, from the perspective of the kernel. | see CPU Utilization above, and section 6 of TST008 | |||
cpu_system | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time that the CPU spent running the kernel. | see CPU Utilization above, and section 6 of TST008 | |||
cpu_user | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | Time CPU spends running un-niced user space processes.
| see CPU Utilization above, and section 6 of TST008 | |||
cpu_wait | Host | The host CPUs, individually or total usage summed across all CPUs | nanoseconds or percentage | The time the CPU spends idle while waiting for an I/O operation to complete | see CPU Utilization above, and section 6 of TST008 | |||
total_vcpu_utilization | Host | The host CPUs used by a guest, total usage summed across all CPUs | nanoseconds or percentage | The total utilization summed across all execution contexts (except Idle) and all CPUs in Scope. | see CPU Utilization above, and section 6 of TST008 |
Networking
At a minimum the following metrics should be collected for a Networking interface:
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
Total Packets received | see section 7 of TST008 | |||||||
Total Packets transmitted | see section 7 of TST008 | |||||||
Total Octets received | see section 7 of TST008 | |||||||
Total Octets transmitted | see section 7 of TST008 | |||||||
Total Error frames received | see section 7 of TST008 | |||||||
Total Errors when attempting to transmit a frame | see section 7 of TST008 | |||||||
Broadcast Packets | ||||||||
Multicast Packet | ||||||||
Average bitrate | ||||||||
Average latency |
| |||||||
RX Packets dropped | ||||||||
TX packets dropped |
Networking MIBs
Where possible the metrics, events and information should be supported for the following Networking MIBs:
MIB Name | RFC | Description |
---|---|---|
IF-MIB | RFC2863 | Network interface sub-layers |
EtherLike-MIB | RFC3635 | Ethernet like network interfaces |
IP-MIB | RFC4293 | IP and ICMP without routing info |
IP-FORWARD_MIB | RFC4292 | CIDR multipath IP routes |
TCP-MIB | RFC4022 | TCP stack counters and info |
UDP-MIB | RFC4133 | UDP counters and info |
IPV6 MIBs | RFC2465 RFC2466 RFC2452 RFC2454 | IPv6 equivalents
|
SCTP-MIB | RFC3873 | SCTP protocol |
UCD-IPFWACC-MIB |
| IP firewall accounting firewall rules |
Virtual Switch Reporting
...
Per interface (stats and info mentioned in the tables above) from Open vSwitch/Open vSwitch on DPDK/ VPP should be collected and exposed.
sflow, Netflow/IPFIX flow telemetry should be supported, collected and exposed.
Storage
Disk Utilization
Name | Collection location | Parameters | Scope of coverage | Unit(s) of measure | Definition | Method of Measurement | Sources of Error | Comments |
---|---|---|---|---|---|---|---|---|
The host CPUs, individually or total usage summed across all CPUs