Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each monitoring process in a deployment should support the following events:

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
 Heartbeat/ping Host/Guest (where the monitoring process is running)ping frequency and size of packet liveliness check N/A Heartbeat/ping to check liveliness of monitoring process external pingfalse alarm for host due to network interruption 

 

Each monitoring process in a deployment should support the following Metrics:

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

write_queue/queue_length

 Host/Guest (where the monitoring process is running)measurement frequencyThe monitoring application being used The number of metrics currently in the write queue.    

write_dropped

 Host/Guest (where the monitoring process is running)measurement frequencyThe monitoring application being used The number of metrics dropped due to a queue length limitation.   

cache_size

 Host/Guest (where the monitoring process is running)measurement frequencyThe monitoring application being used The number of elements in the metric cache   
CPU utilization Host/Guest (where the monitoring process is running)measurement frequency, interrupt frequency, set of execution contexts, time of measurementThe CPUs that are being used by the monitoring applicationNanoseconds or percentage of total CPU utilizationThe CPU utilization of the monitoring processkernel interrupt to read current execution contextshort-lived contexts may come and go between interruptssee section 6 of TST008
Memory Utilization Host/Guest (where the monitoring process is running)Time of measurement, total memory available, swap space configuredThe Memory that is being used by the monitoring applicationKibibytesThe amount of physical RAM, in kibibytes, used by the monitoring applicationmemory management reports current values at time of measurement see section 8 of TST008

 

NFVI Other/Additional Information

...

  • Machine check exceptions (System, Processor, Memory...) [TODO: Break this down further]
    • DIMM corrected and uncorrected Errors

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

MCEsHost Memory, CPU, IO Machine Check Exceptionusing mcelog  
PCIe ErrorsHost       


Networking

At a minimum the following events should be monitored for a Networking interface:

  • Link Status
  • Dropped Receive Packets – An increasing count could indicate the failure or service interruption of an upstream processes.  

vSwitch liveliness

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
Link Status        
vSwitch Status (liveliness)        
Packet Processing Core Status        

 

Storage

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
         

NFVI Metrics

Compute

At a minimum the following metrics should be collected:

  • CPU utilization TODO: Break this down further]
  • vCPU utilization TODO: Break this down further]
  • Memory utilization TODO: Break this down further]
  • vMemory utilization TODO: Break this down further]
  • Cache utilization
    • Hits
    • Misses
    • Instructions per clock (IPC)
    • Last level cache utilization
    • Memory Bandwidth utilization
  • Platform Metrics (thermals, fan-speed) [TODO: Break this down further]

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
cpu_idleHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime the host CPU spends idle  see CPU Utilization above, and section 6 of TST008
cpu_niceHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime the host CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.  see CPU Utilization above, and section 6 of TST008
cpu_interruptHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime the CPU has spent servicing (hardware) interrupts.  see CPU Utilization above, and section 6 of TST008
cpu_softirqHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref]  see CPU Utilization above, and section 6 of TST008
cpu_stealHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageCPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.”  It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle.  It is time that went missing, from the perspective of the kernel.  see CPU Utilization above, and section 6 of TST008
cpu_systemHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime that the CPU spent running the kernel.  see CPU Utilization above, and section 6 of TST008
cpu_userHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage

Time CPU spends running un-niced user space processes.

 

  see CPU Utilization above, and section 6 of TST008
cpu_waitHost The host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageThe time the CPU spends idle while waiting for an I/O operation to complete  see CPU Utilization above, and section 6 of TST008
total_vcpu_utilizationHost The host CPUs used by a guest, total usage summed across all CPUsnanoseconds or percentageThe total utilization summed across all execution contexts (except Idle) and all CPUs in Scope.  see CPU Utilization above, and section 6 of TST008

Networking

At a minimum the following metrics should be collected for a Networking interface:

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
Total Packets received       see section 7 of TST008
Total Packets transmitted       see section 7 of TST008
Total Octets received       see section 7 of TST008
Total Octets transmitted       see section 7 of TST008
Total Error frames received       see section 7 of TST008
Total Errors when attempting to transmit a frame       see section 7 of TST008
Broadcast Packets        
Multicast Packet        

Average bitrate

        
Average latency       

 

RX Packets dropped        
TX packets dropped        

 

Networking MIBs

 

Where possible the metrics, events and information should be supported for the following Networking MIBs:

 

MIB Name

RFC

Description

IF-MIB

RFC2863

Network interface sub-layers

EtherLike-MIB

RFC3635

Ethernet like network interfaces

IP-MIB

RFC4293

IP and ICMP without routing info

IP-FORWARD_MIB

RFC4292

CIDR multipath IP routes

TCP-MIB

RFC4022

TCP stack counters and info

UDP-MIB

RFC4133

UDP counters and info

IPV6 MIBs

RFC2465 RFC2466 RFC2452 RFC2454

IPv6 equivalents

 

SCTP-MIB

RFC3873

SCTP protocol

UCD-IPFWACC-MIB

 

IP firewall accounting firewall rules

 

Virtual Switch Reporting

...

  • Per interface (stats and info mentioned in the tables above) from Open vSwitch/Open vSwitch on DPDK/ VPP should be collected and exposed.

  • sflow, Netflow/IPFIX flow telemetry should be supported, collected and exposed.

Storage

Disk Utilization

Name

Collection locationParameters

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
        

 

The host CPUs, individually or total usage summed across all CPUs