Collectd Metrics and Events

Statistics

Statistics in collectd consist of a value list. A value list includes:

Value list		Example	comment
Values		99.8999	percentage
Value length	the number of values in the data set.
Time	timestamp at which the value was collected.	1475837857	epoch
Interval	interval at which to expect a new value.	10	interval
Host	used to identify the host.	localhost	can be uuid for vm or host… or can give host a name
Plugin	used to identify the plugin.	cpu
Plugin instance (optional)	used to group a set of values together. For e.g. values belonging to a DPDK interface.	0
Type	unit used to measure a value. In other words used to refer to a data set.	percent
Type instance (optional)	used to distinguish between values that have an identical type.	user
meta data	an opaque data structure that enables the passing of additional information about a value list. “Meta data in the global cache can be used to store arbitrary information about an identifier”

Notifications

Notifications in collectd are generic messages containing:

An associated severity, which can be one of OKAY, WARNING, and FAILURE.
A time.
A Message
A host.
A plugin.
A plugin instance (optional).
A type.
A types instance (optional).
Meta-data.

Example notification:

Severity:FAILURE

Time:1472552207.385

Host:pod3-node1

Plugin:dpdkevents

PluginInstance:dpdk0

Type:gauge

TypeInstance:link_status

DataSource:value

CurrentValue:1.000000e+00

WarningMin:nan

WarningMax:nan

FailureMin:2.000000e+00

FailureMax:nan

Hostpod3-node1, plugin dpdkevents (instance dpdk0) type gauge (instance link_status): Data source "value" is currently 1.000000. That is below the failure threshold of 2.000000.

Supported Metrics and Events

Dynamic Metrics

Reference starting point: https://github.com/collectd/collectd/blob/master/src/types.db

But below is a mapping of the "base" plugins that would run on the host/the guest.

Events

Notifications can be applied to an field collectd in the Metrics tab to produce notifications based on thresholding.
Severities can be one of OKAY, Warning or Failure
Metrics under implementation or in the process of upstreaming are in pink

Where collectd is running	Plugin	Type	Type Instance	Severity	Description	comment
host/guest	ovs_events	gauge	link_status	Warning on Link Status Down	Link status of the OvS interface: UP or DOWN Severity will be configurable by the end user
				OKAY on link Status Up
host/guest	dpdk_events		link_status	Warning on Link Status Down, OKAY on link status up	Link status of the OvS interface: UP or DOWN Severity will be configurable by the end user
			keep_alive	OKAY: if core status is ALIVE, UNUSED, DOZING, SLEEP Warning: if core status is MISSING Failure: if core status is DEAD or GONE	Reflects the state of DPDK packet processing cores	protects against packet processing core failures for DPDK --> no slient packet drops
host	pcie	correctable	non_fatal	Notification (Warning) in case of PCIe correctable error occurrence. Message contains short error description.	Correctable Errors include: Receiver Error Status Bad TLP Status Bad DLLP Status REPLAY_NUM Rollover Replay Timer Timeout Advisory Non-Fatal Corrected Internal Header Log Overflow
							uncorrectable	fatal	Notification (Failure) in case of PCIe uncorrectable fatal error occurrence. Message contains short error description.	Uncorrectable Errors include: Data Link Protocol Surprise Down Poisoned TLP Flow Control Protocol Completion Timeout Completer Abort Unexpected Completion Receiver Overflow Malformed TLP ECRC Error Status Unsupported Request ACS Violation Internal MC blocked TLP Atomic egress blocked TLP prefix blocked
								non_fatal	Notification (Warning) in case of PCIe uncorrectable non-fatal error occurrence. Message contains short error description.
		host	mcelog (RAS memory)	errors		Failure on failure to connect to the mcelog socket/ if connection is lost	Reports Corrected and Uncorrected DIMM Failures
						OK on connection to mcelog socket
Warning for Corrected Memory Errors
Failure for Uncorrected Memory Errors
host	IPMI			OKAY - upper non-critical	Each IPMI sensor may have six different thresholds: upper non-recoverable upper critical upper non-critical lower non-critical lower critical lower non-recoverable	You may have events on a threshold sensor by specifying values (called thresholds) where you want the sensor to report an event. Then you can enable the events for the specific thresholds. Not all sensors support all thresholds, some cannot have their events enabled and others cannot have them disabled. The capabilities of a sensor may all be queried by the user to determine what it can do. When the value of the sensor goes outside the threshold an event may be generated. An event may be generated when the value goes back into the threshold
				OKAY - lower non-critical
				WARNING- lower critical
				WARNING - upper critical
				FAILURE - upper non-recoverable
				FAILURE - lower non-recoverable
				discrete sensor status changes are also reported out via OKAY, WARNING and FAILURE notifications. Examples of discrete sensors can be found under the "IPMI Sensors for S2600WT2R" tab
host	mcelog RAS System, CPU, QPI, OI (specific to a Platform) so these will change depending on what's supported by the Platform.			WARNING - Correctable errors FAILURE - Uncorrectable Errors	Servers based on Intel® Architecture, are generally designed for use in mission critical environments. Reliability, Availability and Serviceability (RAS) features, are integrated into the servers to address the error handling and memory mirroring and sparing required by these environments. The goal of this feature is to expose the RAS features provided by the Broadwell or newer platfrom to higher level fault management applications. The Features to be exposed fall under the following catagories: Reliability Features: -System attributes to ensure Data integrity. -capability to prevent, detect, correct and contain faults over a given time interval. Availability Features: -System attributes to help stay operational in the presence of faults in the system. -Capability to map out failed units, ability to operate in a degraded mode. Serviceability Features: -System attributes to help system service, repair. -Capability to identify failed units, and facilitates repair. Generic Error Handling The Silicon supports corrected, uncorrected (recoverable, unrecoverable), fatal and catastrophic error types. Corrected Errors Errors that are corrected by either hardware or software, corrected error information is used in predictive failure analysis by the OS. MCA Banks corrected errors except selected memory corrected errors are handled directly by the OS. HASWELL-EP PROCESSOR triggers CMCI for the corrected errors, on CMCI OS can read the MCA Banks and collect error status. All the other platform related corrected errors can either be ignored or can be logged into BMC SEL based on platform policy. Memory Corrected Errors Memory corrected errors such as mirror fail over, memory read errors can be configured to trigger SMI using BIOS setup options. On memory mirror fail over BIOS logs the error for the OS as per the UEFI error record format. On memory read errors, BIOS does the following memory RAS operations in the order to correct the error. Rank Sparing SDDC/Device tagging UnCorrected Non Fatal Errors Errors that are not corrected by hardware, in general these errors trigger machine check exception and in turn triggers SMI. BIOS SMI handler logs these error information, clear the error status and pass the error log to OS. OS can recover from the error, in cases where the recovery is not an option, can trigger a system reset. Uncorrected Fatal Errors Errors that are neither corrected by hardware nor recovered by the s/w, the system is not in a reliable state and needs a reset to bring it back up to normal operation. In most fatal error conditions, BIOS cannot log errors before the system reset happens. All the Error status registers are sticky on the reset, BIOS collects all these information in the next boot, creates error record and pass it on the OS. Error Logging Example Errors are provided in the comments column	/* See IA32 SDM Vol3B Chapter 16*/ Integrated Memory Controller Machine Check Errors "Address parity error", "HA write data parity error", "HA write byte enable parity error", "Corrected patrol scrub error", "Uncorrected patrol scrub error", "Corrected spare error", "Uncorrected spare error", "Any HA read error", "WDB read parity error", "DDR4 command address parity error", "Uncorrected address parity error" "Unrecognized request type", "Read response to an invalid scoreboard entry", "Unexpected read response", "DDR4 completion to an invalid scoreboard entry", "Completion to an invalid scoreboard entry", "Completion FIFO overflow", "Correctable parity error", "Uncorrectable error", "Interrupt received while outstanding interrupt was not ACKed", "ERID FIFO overflow", "Error on Write credits", "Error on Read credits", "Scheduler error", "Error event", "MscodDataRdErr", "Reserved", "MscodPtlWrErr", "MscodFullWrErr", "MscodBgfErr", "MscodTimeout", "MscodParErr", "MscodBucket1Err"
						Interconnect(QPI) Machine Check Errors "UC Phy Initialization Failure", "UC Phy detected drift buffer alarm", "UC Phy detected latency buffer rollover", "UC LL Rx detected CRC error: unsuccessful LLR: entered abort state", "UC LL Rx unsupported or undefined packet", "UC LL or Phy control error", "UC LL Rx parameter exchange exception", "UC LL detected control error from the link-mesh interface", "COR Phy initialization abort", "COR Phy reset", "COR Phy lane failure, recovery in x8 width", "COR Phy L0c error corrected without Phy reset", "COR Phy L0c error triggering Phy Reset", "COR Phy L0p exit error corrected with Phy reset", "COR LL Rx detected CRC error - successful LLR without Phy Reinit", "COR LL Rx detected CRC error - successful LLR with Phy Reinit" "Phy Control Error", "Unexpected Retry.Ack flit", "Unexpected Retry.Req flit", "RF parity error", "Routeback Table error", "unexpected Tx Protocol flit (EOP, Header or Data)", "Rx Header-or-Credit BGF credit overflow/underflow", "Link Layer Reset still in progress when Phy enters L0", "Link Layer reset initiated while protocol traffic not idle", "Link Layer Tx Parity Error"
						Internal Machine Check Errors "No Error", "MCA_DMI_TRAINING_TIMEOUT", "MCA_DMI_CPU_RESET_ACK_TIMEOUT", "MCA_MORE_THAN_ONE_LT_AGENT", "MCA_BIOS_RST_CPL_INVALID_SEQ", "MCA_BIOS_INVALID_PKG_STATE_CONFIG", "MCA_MESSAGE_CHANNEL_TIMEOUT", "MCA_MSGCH_PMREQ_CMP_TIMEOUT", "MCA_PKGC_DIRECT_WAKE_RING_TIMEOUT", "MCA_PKGC_INVALID_RSP_PCH", "MCA_PKGC_WATCHDOG_HANG_CBZ_DOWN", "MCA_PKGC_WATCHDOG_HANG_CBZ_UP", "MCA_PKGC_WATCHDOG_HANG_C3_UP_SF", "MCA_SVID_VCCIN_VR_ICC_MAX_FAILURE", "MCA_SVID_COMMAND_TIMEOUT", "MCA_SVID_VCCIN_VR_VOUT_FAILURE", "MCA_SVID_CPU_VR_CAPABILITY_ERROR", "MCA_SVID_CRITICAL_VR_FAILED", "MCA_SVID_SA_ITD_ERROR", "MCA_SVID_READ_REG_FAILED", "MCA_SVID_WRITE_REG_FAILED", "MCA_SVID_PKGC_INIT_FAILED", "MCA_SVID_PKGC_CONFIG_FAILED", "MCA_SVID_PKGC_REQUEST_FAILED", "MCA_SVID_IMON_REQUEST_FAILED", "MCA_SVID_ALERT_REQUEST_FAILED", "MCA_SVID_MCP_VR_ABSENT_OR_RAMP_ERROR", "MCA_SVID_UNEXPECTED_MCP_VR_DETECTED", "MCA_FIVR_CATAS_OVERVOL_FAULT", "MCA_FIVR_CATAS_OVERCUR_FAULT", "MCA_WATCHDOG_TIMEOUT_PKGC_SLAVE", "MCA_WATCHDOG_TIMEOUT_PKGC_MASTER", "MCA_WATCHDOG_TIMEOUT_PKGS_MASTER", "MCA_PKGS_CPD_UNCPD_TIMEOUT", "MCA_PKGS_INVALID_REQ_PCH", "MCA_PKGS_INVALID_REQ_INTERNAL", "MCA_PKGS_INVALID_RSP_INTERNAL", "MCA_PKGS_SMBUS_VPP_PAUSE_TIMEOUT", "MCA_RECOVERABLE_DIE_THERMAL_TOO_HOT"

Host	IPMI (specific per BMC) so these will change depending on what's supported by the BMC. This is en example for S2600WT2R platform	percent	MTT CPU2	IPMI defines many types of sensors, but groups them into two main categories: Threshold and discrete. Threshold sensors are “analog”, they have continuous (or mostly continuous) readings. Things like fans speed, voltage, or temperature. Discrete sensors have a set of binary readings that may each be independently zero or one. In some sensors, these may be independent. For instance, a power supply may have both an external power failure and a predictive failure at the same time. In other cases they may be mutually exclusive. For instance, each bit may represent the initialization state of a piece of software.	The IPMI plugin supports analog sensors of type voltage, temperature, fan and current + analog sensors that have VALUE type WATTS, CFM and percentage (%). http://openipmi.sourceforge.net/IPMI.pdf
			MTT CPU1
			P2 Therm Ctrl %
			P1 Therm Ctrl %
			PS1 Curr Out %
		voltage	BB +3.3V Vbat
		voltage	BB +12.0V
		temperature	Agg Therm Mgn 1
			DIMM Thrm Mrgn 4
			DIMM Thrm Mrgn 3
			DIMM Thrm Mrgn 2
			DIMM Thrm Mrgn 1
			P2 DTS Therm Mgn
			P1 DTS Therm Mgn
			P2 Therm Ctrl %
			P1 Therm Ctrl %
			P2 Therm Margin
			P1 Therm Margin
			PS1 Temperature
			LAN NIC Temp
			Exit Air Temp
			HSBP 1 Temp
			I/O Mod Temp
			BB Lft Rear Temp
			BB Rt Rear Temp
			BB BMC Temp
			SSB Temp
			Front Panel Temp
			BB P2 VR Temp
			BB P1 VR Temp
		fan	System Fan 6B
			System Fan 6A
			System Fan 5B
			System Fan 5A
			System Fan 4B
			System Fan 4A
			System Fan 3B
			System Fan 3A
			System Fan 2B
			System Fan 2A
			System Fan 1B
			System Fan 1A
		CFM	System Airflow
		watts	PS1 Input Power
Host	intel_pmu	counter	cpu-cycles	[Hardware event]	The types of events are: Hardware Events: These instrument low-level processor activity based on CPU performance counters. For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. Some will be listed as Hardware Cache Events. Software Events: These are low level events based on kernel counters. For example, CPU migrations, minor faults, major faults, etc. http://www.brendangregg.com/perf.html#Events
			instructions
			cache-references
			cache-misses
			branch-instructionsORbranches
			branch-misses
			bus-cycles
			cpu-clock	[Software event]
			task-clock
			page-faultsORfaults
			minor-faults
			major-faults
			context-switchesORcs
			cpu-migrationsORmigrations
			alignment-faults
			emulation-faults
			L1-dcache-loads	[Hardware cache event]
			L1-dcache-load-misses
			L1-dcache-stores
			L1-dcache-store-misses
			L1-dcache-prefetches
			L1-dcache-prefetch-misses
			L1-icache-loads
			L1-icache-load-misses
			L1-icache-prefetches
			L1-icache-prefetch-misses
			LLC-loads
			LLC-load-misses
			LLC-stores
			LLC-store-misses
			LLC-prefetch-misses
			dTLB-loads
			dTLB-load-misses
			dTLB-stores
			dTLB-store-misses
			dTLB-prefetches
			dTLB-prefetch-misses
			iTLB-loads
			iTLB-load-misses
			branch-loads
			branch-load-misses

Space shortcuts

Page tree

Statistics

Notifications

Supported Metrics and Events

Dynamic Metrics

Events