Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The intention of this wiki is to list the metrics and events we need to collect for that should be monitored or collected within the NFVI. In addition to the metrics/events collected about the NFVI, some information about the monitoring process (the process which collects the information and metrics) itself is also required.

 

This list should be developed in conjunction with the  Doctor (Faults) and VES Projects in OPNFV.

Metrics/Events Format

It's important to define a common format that can be used for the list of identified metrics and events that should be monitored/collected in the NFVI.

  • Name
  • Where the Metric/Event is collected (Host/Guest/Both)
  • Scope of coverage
  • Unit(s) of measure (if applicable in the case of an event)
  • Definition
  • Method of Measurement
  • Sources of Error
  • Comments

Distinction between metrics and events

Information to be collected in conjunction with NFVI Metrics/Events

It's essential to collect some information about the environment that is being monitored as well as the monitoring process(es) themselves in order to associate the mertrics/events with the relevant host.

Host information:

Each host in a deployment should have a Unique identifier that distinguishes it from all other hosts. A UUID can be used in this case. 

Monitoring Process information:

...

Each monitoring process in a deployment should have a Unique Process identifier.

...

Each monitoring process in a deployment should support the following events:

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
 Heartbeat/ping Host/Guest (where the monitoring process is running) liveliness check N/A Heartbeat/ping to check liveliness of monitoring process external ping  

 

Each monitoring process in a deployment should support the following Metrics:

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
        

 

Timing Information

NFVI Other/Additional Information

BIOS information

NFVI Events

What about entire node and switch failures?  In terms of service affecting priority, host and switch failures are at the top as they can affect the most VMs / Containers / VNFs...

...

  • Machine check exceptions (System, Processor, Memory...) [TODO: Break this down further]
    • DIMM corrected and uncorrected Errors

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

        


Networking

At a minimum the following events should be monitored for a Networking interface:

  • Link Status
  • Dropped Receive Packets – An increasing count could indicate the failure or service interruption of an upstream processes.  

vSwitch liveliness

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

       

 

Storage

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments

       

NFVI Metrics

Compute

At a minimum the following metrics should be collected:

  • CPU utilization TODO: Break this down further]
  • vCPU utilization TODO: Break this down further]
  • Memory utilization TODO: Break this down further]
  • vMemory utilization TODO: Break this down further]
  • Cache utilization
    • Hits
    • Misses
    • Instructions per clock (IPC)
    • Last level cache utilization
    • Memory Bandwidth utilization
  • Platform Metrics (thermals, fan-speed) [TODO: Break this down further]

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
       

Networking

[TODO] Add a note on the vSwitch and add vSwitch specific metrics

...

  • Average bitrate
  • Average latency

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
       

Storage

Disk Utilization

NFVI Other/Additional Information

Compute

BIOS information

Networking

Name

Collection location

Scope of coverage

Unit(s) of measure

Definition

Method of Measurement

Sources of Error

Comments
      

...