Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Data



Failure TypeFailure parameter

Image Modified  Failure Event

Image Modified Infrastructure Metrics

Comments
Links

Link Down.

Link removed

Virtual Switch link failure

Reason: Hardware Failure

Interface Down

dhcp-agent.logneutron-dhcp-agent
l3-agent.logneutron-l3-agent
linuxbridge-agent.logneutron-linuxbridge-agent
openvswitch-agent.logneutron-openvswitch-agent

(Ref: https://docs.openstack.org/ocata/config-reference/networking/logs.html)

Network interface status, 

High packet drop,

low throughput,

excessive latency or jitter


crc-statistics, fabric-link-failure, link-flap, transceiver-power-low


VM

Deployment/Start Failures:

  1. Failed to start*
  2. Failed to boot*

Post-Deployment/Start failures:

  1. Shutdown
  2. Crash
  3. Hang*
  4. Panic

nova-compute.log

nova-api.log

nova-scheduler.log

libvirt.log

qemu/$vm.log

neutron-server.log

glance/cinder - 

flavor

Node and Core-mapping


cpu: per-core utilization

memory

Interfaces statistics - sent, recv, drops

Disk Read/Write


If possible, Infrastructure metrics and syslogs from within the VM should be collected.

Deployment/Start failures can be the first step.


Container

Deployment/Start Failures:

  1. Failed to start*
  2. Failed to boot*

Post-Deployment/Start failures:

  1. Shutdown
  2. Crash
  3. Hang
  4. Panic
  • OS layer – syslog, boot.log, kern.log etc.
  • Kubernetes Layer – container Logs (/var/log/containers)
  • OpenStack Layer – OpenStack service Logs


cpu: per-core utilization

memory

Interfaces statistics - sent, recv, drops

Disk Read/Write


Node

A node failure (hardware failure, OS crash, etc)

A) node network connectivity failure

B) nova service failure

C) Failure of other OpenStack services

/var/log/nova/nova-compute.log
(To ensure that it has successfully connected to the AMQP server
Ref: https://docs.openstack.org/operations-guide/ops-maintenance-compute.html)


Cloud controller

nova-*

/var/log/nova

Cloud controller

glance-*

/var/log/glance

Cloud controller

cinder-*

/var/log/cinder

Cloud controller

keystone-*

/var/log/keystone

Cloud controller

neutron-*

/var/log/neutron

Cloud controller

horizon

/var/log/apache2/

All nodes

misc (swift, dnsmasq)

/var/log/syslog

Compute nodes

libvirt

/var/log/libvirt/libvirtd.log

Compute nodes

Console (boot up messages) for VM instances:

/var/lib/nova/instances/instance-<instance id>/console.log

Block Storage nodes

cinder-volume

/var/log/cinder/cinder-volume.log

(Ref: https://docs.openstack.org/operations-guide/ops-logging.html)

A) node network connectivity failure

  1. management network
  2. VMs communication network
  3. storage network

B) nova service failure (e.g., process crashed) -- detected and restarted by a local watchdog process

  1. compute
  2. volume
  3. network
  4. scheduler
  5. api.

C) Failure of other OpenStack services -- N/A, assuming redundant/highly available configuration

  1. Glance
  2. Keystone


Interfaces statistics - sent, recv, drops


Hypervisor Metrics, Nova Server Metrics, Tenant Metrics, Message Queue Metrics




Keystone  and Glance Metrics





ApplicationCrash/Connectivity/Non-Functional

Application Log i.e. If it is Apache then logs of Apache

Packet Drops, Latency, Throughput, Saturation, Resource UsageDeploy Collectd within the application and collect both application logs and infrastructure metrics
Middleware Services



Models

We have taken three types of models and in those models we have considered Failure Prediction problem and the remaining types are given as:
 

...