Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Failure TypeFailure parameter

  Failure Event

 Infrastructure Metrics

Comments
Links

Link Down.

Link removed

Virtual Switch link failure

Reason: Hardware Failure

Interface Down

dhcp-agent.logneutron-dhcp-agent
l3-agent.logneutron-l3-agent
linuxbridge-agent.logneutron-linuxbridge-agent
openvswitch-agent.logneutron-openvswitch-agent

(Ref: https://docs.openstack.org/ocata/config-reference/networking/logs.html)

Network interface status, 

High packet drop,

low throughput,

excessive latency or jitter


crc-statistics, fabric-link-failure, link-flap, transceiver-power-low


VM

Deployment/Start Failures:

  1. Failed to start*
  2. Failed to boot*

Post-Deployment/Start failures:

  1. Shutdown
  2. Crash
  3. Hang*
  4. Panic

nova-compute.log

nova-api.log

nova-scheduler.log

libvirt.log

qemu/$vm.log

neutron-server.log

glance/cinder - 

flavor

Node and Core-mapping


cpu: per-core utilization

memory

Interfaces statistics - sent, recv, drops

Disk Read/Write


If possible, Infrastructure metrics and syslogs from within the VM should be collected.

Deployment/Start failures can be the first step.


Container

Deployment/Start Failures:

  1. Failed to start*
  2. Failed to boot*

Post-Deployment/Start failures:

  1. Shutdown
  2. Crash
  3. Hang
  4. Panic
  • OS layer – syslog, boot.log, kern.log etc.
  • Kubernetes Layer – container Logs (/var/log/containers)
  • OpenStack Layer – OpenStack service Logs


cpu: per-core utilization

memory

Interfaces statistics - sent, recv, drops

Disk Read/Write


Node

A node failure (hardware failure, OS crash, etc)

A) node network connectivity failure

B) nova service failure

C) Failure of other OpenStack services

/var/log/nova/nova-compute.log
(To ensure that it has successfully connected to the AMQP server
Ref: https://docs.openstack.org/operations-guide/ops-maintenance-compute.html)


Cloud controller

nova-*

/var/log/nova

Cloud controller

glance-*

/var/log/glance

Cloud controller

cinder-*

/var/log/cinder

Cloud controller

keystone-*

/var/log/keystone

Cloud controller

neutron-*

/var/log/neutron

Cloud controller

horizon

/var/log/apache2/

All nodes

misc (swift, dnsmasq)

/var/log/syslog

Compute nodes

libvirt

/var/log/libvirt/libvirtd.log

Compute nodes

Console (boot up messages) for VM instances:

/var/lib/nova/instances/instance-<instance id>/console.log

Block Storage nodes

cinder-volume

/var/log/cinder/cinder-volume.log

(Ref: https://docs.openstack.org/operations-guide/ops-logging.html)

A) node network connectivity failure

  1. management network
  2. VMs communication network
  3. storage network

B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process

  1. compute
  2. volume
  3. network
  4. scheduler
  5. api.

C) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration

  1. Glance
  2. Keystone


Interfaces statistics - sent, recv, drops


Hypervisor Metrics, Nova Server Metrics, Tenant Metrics, Message Queue Metrics




Keystone  and Glance Metrics





ApplicationCrash/Connectivity/Non-Functional

Application Log i.e. If it is Apache then logs of Apache

(/var/log/apache2)

Packet Drops, Latency, Throughput, Saturation, Resource UsageDeploy Collectd within the application and collect both application logs and infrastructure metrics
Middleware Services



...