Anuket Project

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 28 Next »

This wiki is a WIP. Please feel free to modify this page with relevant information

The intention of this wiki is to list the metrics and events that should be monitored or collected within the NFVI. In addition to the metrics/events collected about the NFVI, some information about the monitoring process (the process which collects the information and metrics) itself is also required.

This list should be developed in conjunction with the  Doctor (Faults) and VES Projects in OPNFV.

This wiki heavily references The ETSI NFV draft titled “Network Functions Virtualisation (NFV); Testing; NFVI Compute and Network Metrics Specification” which can be found at https://docbox.etsi.org/ISG/NFV/Open/Drafts/TST008

Metrics/Events Format

It's important to define a common format that can be used for the list of identified metrics and events that should be monitored/collected in the NFVI.

  • Name
  • Where the Metric/Event is collected (Host/Guest/Both)
  • Scope of coverage
  • Unit(s) of measure or associated severities
  • Definition
  • Method of Measurement
  • Sources of Error
  • Comments

Distinction between metrics and events

For the purposes of Platform Service Assurance, it's important to distinguish between metrics and events as well as how they are measured (from a timing perspective).

A Metric is a (standard) definition of a quantity describing the performance and/or reliability of a monitored function, which has an intended utility and is carefully specified to convey the exact meaning of the measured value. A measured value of a metric is produced in an assessment of a monitored function according to a method of measurement. For example the number of dropped packets for a networking interface is a metric.

 

An Event is defined as an important state change in a monitored function.  The monitor system is notified that an event has occurred using a message with a standard format. The Event notification describes the significant aspects of the event, such as the name and ID of the monitored function, the type of event, and the time the event occurred. For example, an event notification would take place if the link status of a networking device on a compute node suddenly changes from up to down on a node hosting VNFs in an NFV deployment.

 

Information to be collected in conjunction with NFVI Metrics/Events

It's essential to collect some information about the environment that is being monitored as well as the monitoring process(es) themselves in order to associate the mertrics/events with the relevant host.

Host information:

Each host in a deployment should have a Unique identifier that distinguishes it from all other hosts. A UUID can be used in this case.

Monitoring Process information:

Each monitoring process in a deployment should have a Unique Process identifier.

Each monitoring process in a deployment should support the following events:

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
 Heartbeat/ping Host/Guest (where the monitoring process is running) liveliness check N/A Heartbeat/ping to check liveliness of monitoring process external ping  

 

Each monitoring process in a deployment should support the following Metrics:

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

write_queue/queue_length

 Host/Guest (where the monitoring process is running)The monitoring application being used The number of metrics currently in the write queue.    

write_dropped

 Host/Guest (where the monitoring process is running)The monitoring application being used The number of metrics dropped due to a queue length limitation.   

cache_size

 Host/Guest (where the monitoring process is running)The monitoring application being used The number of elements in the metric cache   
CPU utilization Host/Guest (where the monitoring process is running)The CPUs that are being used by the monitoring applicationNanoseconds or percentage of total CPU utilizationThe CPU utilization of the monitoring process   
Memory Utilization Host/Guest (where the monitoring process is running)The Memory that is being used by the monitoring application The amount of physical RAM, in kibibytes, used by the monitoring application   

 

Timing Information

NFVI Other/Additional Information

BIOS information

NFVI Events

What about entire node and switch failures?  In terms of service affecting priority, host and switch failures are at the top as they can affect the most VMs / Containers / VNFs...

While the status of switches and hosts might be the domain of services that have a system-wide view, a host-resident component might be part of the monitoring functionality.

Compute

At a minimum the following events should be monitored:

  • Machine check exceptions (System, Processor, Memory...) [TODO: Break this down further]
    • DIMM corrected and uncorrected Errors

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

MCEsHostMemory, CPU, IO Machine Check Exceptionusing mcelog  
PCIe ErrorsHost      


Networking

At a minimum the following events should be monitored for a Networking interface:

  • Link Status
  • Dropped Receive Packets – An increasing count could indicate the failure or service interruption of an upstream processes.  

vSwitch liveliness

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
Link Status       
vSwitch Status (liveliness)       
Packet Processing Core Status       

 

Storage

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
        

NFVI Metrics

Compute

At a minimum the following metrics should be collected:

  • CPU utilization TODO: Break this down further]
  • vCPU utilization TODO: Break this down further]
  • Memory utilization TODO: Break this down further]
  • vMemory utilization TODO: Break this down further]
  • Cache utilization
    • Hits
    • Misses
    • Instructions per clock (IPC)
    • Last level cache utilization
    • Memory Bandwidth utilization
  • Platform Metrics (thermals, fan-speed) [TODO: Break this down further]

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
cpu_idleHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime the host CPU spends idle   
cpu_niceHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentageTime the host CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.   
cpu_interruptHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
cpu_softirqHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
cpu_stealHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
cpu_systemHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
cpu_userHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
cpu_waitHostThe host CPUs, individually or total usage summed across all CPUsnanoseconds or percentage    
total_vcpu_utilizationHostThe host CPUs used by a guest, total usage summed across all CPUsnanoseconds or percentage    

Networking

[TODO] Add a note on the vSwitch and add vSwitch specific metrics

At a minimum the following metrics should be collected for a Networking interface:

  • Total Packets received and transmitted
  • Total Octets (TX and RX)
  • Dropped packets (TX and RX)
  • Error frames (TX and RX) [TODO: Break this down further – just tried to do that...]
    • Frame Check Sequence Errors or CRC Errors
    • Runts (frames <64 octets in length)
    • Giants (frames >6000 octets in length)
  • Broadcast Packets (TX and RX)
  • Multicast Packets (TX and RX)

Other Metrics that should be collected for a Networking interface (if possible):

  • Average bitrate
  • Average latency

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
Total Packets received       
Total Packets transmitted       
Total Octets received       
Total Octets transmitted       
Total Error frames received       
Total Error frames transmitted       
Broadcast Packets       
Multicast Packet       

Average bitrate

       
Average latency       

Storage

Disk Utilization

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
       

 

The host CPUs, individually or total usage summed across all CPUs

  • No labels