Page History

...

An Event is defined as an important state change in a monitored function. The monitor system is notified that an event has occurred using a message with a standard format. The Event notification describes the significant aspects of the event, such as the name and ID of the monitored function, the type of event, and the time the event occurred. For example, an event notification would take place if the link status of a networking device on a compute node suddenly changes from up to down on a node hosting VNFs in an NFV deployment.

Collector requirements:

Polling vs Event capture for the monitoring agent

...

In the context of the monitoring agent polling the subsystem it's querying. Both polling and event driven updates should be supported with events driven updates being the preferred model to use. This depends on the subsystem you are monitoring, default would be to leverage event based systems where they exist, but polling should be supported as a configuration option that can be selected by the end user.
...
Fault events should always use a push model, and the mechanism over which events are sent needs to be reliable.
Telemetry, can be polled or pushed (could be polled to spread the load on the collection side).
Network (over)load should be taken into consideration as regards which model to use (push vs pull), you don't want to destabilize the network. push is more scalable overall and preferred for fault management.

Collector configuration

Should be able to dynamically:

Enable/disable/or restart resource monitoring
Get values/notifications
Get capabilities
Get the list of metrics being collected
flush the list of metrics
Set thresholds for resources
blacklist resources
support some sort of buffering mechanism, and should be able to configure
get the timing information for the agent and do aTiming sync if required.

...

Collector Time stamping support

...

Currently there are 2 scenarios as regards time stamps with samples:

1. Where the subsystem we are reading from CAN provide us with the “incident” time (time at which an event occurred) and the collector can provide us with the collection time (time at which a sample was collected): In this case we have the “incident” time for the sample/event and the time when a collector retrieves the sample...

2. where the subsystem we are reading from CANNOT provide us with the “incident” time only the collection time: In this case we only have the time for when the collector retrieves the sample.

The recommendation for collectors where possible is to collect both incident time and collection time and send them with a sample.

For collectd there is only 1 time stamp field. The recommendation is to send the collection time in the collectd time stamp field for values and notifications- BUT where detection time is available to send it in the metadata.

...

In addition to the measurement result, items marked "+" should either be available for collection, or reported with the measurement result.

Information to be collected in conjunction with NFVI Metrics/Events

...

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments
Heartbeat/ping	Host/Guest (where the monitoring process is running)	ping frequency and size of packet	liveliness check	N/A	Heartbeat/ping to check liveliness of monitoring process	external ping	false alarm for host due to network interruption

Each monitoring process in a deployment should support the following Metrics:

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments
`write_queue/queue_length`	Host/Guest (where the monitoring process is running)	measurement frequency	The monitoring application being used		The number of metrics currently in the write queue.
`write_dropped`	Host/Guest (where the monitoring process is running)	measurement frequency	The monitoring application being used			The number of metrics dropped due to a queue length limitation.
`cache_size`	Host/Guest (where the monitoring process is running)	measurement frequency	The monitoring application being used		The number of elements in the metric cache
CPU utilization	Host/Guest (where the monitoring process is running)	measurement frequency, interrupt frequency, set of execution contexts, time of measurement	The CPUs that are being used by the monitoring application	Nanoseconds or percentage of total CPU utilization	The CPU utilization of the monitoring process	kernel interrupt to read current execution context	short-lived contexts may come and go between interrupts	see section 6 of TST008
Memory Utilization	Host/Guest (where the monitoring process is running)	Time of measurement, total memory available, swap space configured	The Memory that is being used by the monitoring application	Kibibytes	The amount of physical RAM, in kibibytes, used by the monitoring application	memory management reports current values at time of measurement			see section 8 of TST008

...

NFVI Other/Additional Information

...

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments
MCEs	Host			Memory, CPU, IO		Machine Check Exception	using mcelog
PCIe Errors	Host

Networking

At a minimum the following events should be monitored for a Networking interface:

...

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments
Link Status
vSwitch Status (liveliness)
Packet Processing Core Status

Storage

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments

NFVI Metrics

Compute

At a minimum the following metrics should be collected:

...

Name	Collection location	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement
cpu_idle	Host		The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time the host CPU spends idle.			see CPU Utilization above, and section 6 of TST008
cpu_nice	Host	The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time the host CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.		see CPU Utilization above, and section 6 of TST008
cpu_interrupt	Host		The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time the CPU has spent servicing (hardware) interrupts.			see CPU Utilization above, and section 6 of TST008
cpu_softirq	Host		The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref]		see CPU Utilization above, and section 6 of TST008
cpu_steal	Host		The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	CPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.” It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle. It is time that went missing, from the perspective of the kernel.			see CPU Utilization above, and section 6 of TST008
cpu_system	Host	The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time that the CPU spent running the kernel.			see CPU Utilization above, and section 6 of TST008
cpu_user	Host		The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	Time CPU spends running un-niced user space processes.				see CPU Utilization above, and section 6 of TST008
cpu_wait	Host	The host CPUs, individually or total usage summed across all CPUs	nanoseconds or percentage	The time the CPU spends idle while waiting for an I/O operation to complete		see CPU Utilization above, and section 6 of TST008
total_vcpu_utilization	Host		The host CPUs used by a guest, total usage summed across all CPUs	nanoseconds or percentage	The total utilization summed across all execution contexts (except Idle) and all CPUs in Scope.			see CPU Utilization above, and section 6 of TST008

...

Name
Total Packets received		see section 7 of TST008
Total Packets transmitted		see section 7 of TST008
Total Octets received		see section 7 of TST008
Total Octets transmitted		see section 7 of TST008
Total Error frames received	see section 7 of TST008
Total Errors when attempting to transmit a frame		see section 7 of TST008
Broadcast Packets
Multicast Packet
Average bitrate
Average latency
RX Packets dropped
TX packets dropped

Networking MIBs

...

Where possible the metrics, events and information should be supported for the following Networking MIBs:

MIB Name
RFC
Description
IF-MIB
RFC2863
Network interface sub-layers
EtherLike-MIB
RFC3635
Ethernet like network interfaces
IP-MIB
RFC4293
IP and ICMP without routing info
IP-FORWARD_MIB
RFC4292
CIDR multipath IP routes
TCP-MIB
RFC4022
TCP stack counters and info
UDP-MIB
RFC4133
UDP counters and info
IPV6 MIBs
RFC2465 RFC2466 RFC2452 RFC2454
IPv6 equivalents

SCTP-MIB
RFC3873
SCTP protocol
UCD-IPFWACC-MIB

IP firewall accounting firewall rules

Virtual Switch Reporting

Per interface (stats and info mentioned in the tables above) from Open vSwitch/Open vSwitch on DPDK/ VPP should be collected and exposed.
sflow, Netflow/IPFIX flow telemetry should be supported, collected and exposed.

Storage

Note: collectd plugins df and disk

can help here.

Disk Utilization

Name	Collection location	Parameters	Scope of coverage	Unit(s) of measure	Definition	Method of Measurement	Sources of Error	Comments

...

The host CPUs, individually or total usage summed across all CPUs

Space shortcuts

Page tree

Versions Compared

Old Version 41

New Version 42

Key

Collector requirements:

Polling vs Event capture for the monitoring agent

Collector configuration

Collector Time stamping support

Information to be collected in conjunction with NFVI Metrics/Events

NFVI Other/Additional Information

Networking

Storage

NFVI Metrics

Compute

Networking MIBs

Virtual Switch Reporting

Storage

MIB Name	RFC	Description
IF-MIB	RFC2863	Network interface sub-layers
EtherLike-MIB	RFC3635	Ethernet like network interfaces
IP-MIB	RFC4293	IP and ICMP without routing info
IP-FORWARD_MIB	RFC4292	CIDR multipath IP routes
TCP-MIB	RFC4022	TCP stack counters and info
UDP-MIB	RFC4133	UDP counters and info
IPV6 MIBs	RFC2465 RFC2466 RFC2452 RFC2454	IPv6 equivalents
SCTP-MIB	RFC3873	SCTP protocol
UCD-IPFWACC-MIB		IP firewall accounting firewall rules