Collectd Metrics and Events

Statistics

Statistics in collectd consist of a value list. A value list includes:

Value list		Example	comment
Values		99.8999	percentage
Value length	the number of values in the data set.
Time	timestamp at which the value was collected.	1475837857	epoch
Interval	interval at which to expect a new value.	10	interval
Host	used to identify the host.	localhost	can be uuid for vm or host… or can give host a name
Plugin	used to identify the plugin.	cpu
Plugin instance (optional)	used to group a set of values together. For e.g. values belonging to a DPDK interface.	0
Type	unit used to measure a value. In other words used to refer to a data set.	percent
Type instance (optional)	used to distinguish between values that have an identical type.	user
meta data	an opaque data structure that enables the passing of additional information about a value list. “Meta data in the global cache can be used to store arbitrary information about an identifier”

Notifications

Notifications in collectd are generic messages containing:

An associated severity, which can be one of OKAY, WARNING, and FAILURE.
A time.
A Message
A host.
A plugin.
A plugin instance (optional).
A type.
A types instance (optional).
Meta-data.

Example notification:

Severity:FAILURE

Time:1472552207.385

Host:pod3-node1

Plugin:dpdkevents

PluginInstance:dpdk0

Type:gauge

TypeInstance:link_status

DataSource:value

CurrentValue:1.000000e+00

WarningMin:nan

WarningMax:nan

FailureMin:2.000000e+00

FailureMax:nan

Hostpod3-node1, plugin dpdkevents (instance dpdk0) type gauge (instance link_status): Data source "value" is currently 1.000000. That is below the failure threshold of 2.000000.

Supported Metrics and Events

Metrics

Reference starting point: https://github.com/collectd/collectd/blob/master/src/types.db

But below is a mapping of the "base" plugins that would run on the host/the guest.

Where collectd is running	Plugin	Type	Type Instance	Description	comment
Host/guest	CPU	percent/nanoseconds	idle	Time CPU spends idle.	Can be per cpu/aggregate across all the cpus.For more info, please see:http://man7.org/linux/man-pages/man1/top.1.html http://blog.scoutapp.com/articles/2015/02/24/understanding-linuxs-cpu-stats Note that jiffies operate on a variable time base, HZ. The default value of HZ should be used (100), yielding a jiffy value of 0.01 seconds) [time(7)]. Also, the actual number of jiffies in each second is subject to system factors, such as use of virtualization. Thus, the percent calculation based on jiffies will nominally sum to 100% plus or minus error.

		percent/nanoseconds	nice	Time the CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.
		percent/nanoseconds	interrupt	Time the CPU has spent servicing interrupts.
		percent/nanoseconds	softirq	(apparently) Time spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref]
		percent/nanoseconds	steal	CPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.” It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle. It is time that went missing, from the perspective of the kernel.http://www.stackdriver.com/understanding-cpu-steal-experiment/
		percent/nanoseconds
		percent/nanoseconds	system	Time that the CPU spent running the kernel.
		percent/nanoseconds	user	Time CPU spends running un-niced user space processes.
		percent/nanoseconds	wait	The time the CPU spends idle while waiting for an I/O operation to complete
	Interface	if_dropped	in	The total number of received dropped packets.
		if_errors	in	The total number of received error packets.	http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html
		if_octets	in	The total number of received bytes.
		if_packets	in	The total number of received packets.
		if_dropped	out	The total number of transmit packets dropped
		if_errors	out	The total number of transmit error packets. (This is the total of error conditions encountered when attempting to transmit a packet. The code here explains the possibilities, but this code is no longer present in /net/core/dev.c master at present - it appears to have moved to /net/core/net-procfs.c.)
		if_octets	out	The total number of bytes transmitted
		if_packets	out	The total number of transmitted packets
	Memory	memory	buffered	The amount, in kibibytes, of temporary storage for raw disk blocks.	https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html
		memory	cached	The amount of physical RAM, in kibibytes, left unused by the system.
		memory	free	The amount of physical RAM, in kibibytes, left unused by the system.
		memory	slab_recl	The part of Slab that can be reclaimed, such as caches.	Slab — The total amount of memory, in kibibytes, used by the kernel to cache data structures for its own use
		memory	slab_unrecl	The part of Slab that cannot be reclaimed even when lacking memory	https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html
		memory	used	mem_used = mem_total - (mem_free + mem_buffered + mem_cached + mem_slab_total);	https://github.com/collectd/collectd/blob/master/src/memory.c#L349
	disk	disk_io_time	io_time	time spent doing I/Os (ms). You can treat this metric as a device load percentage (Value of 1 sec time spent matches 100% of load).
		disk_io_time	weighted_io_time	measure of both I/O completion time and the backlog that may be accumulating.
		disk_merged	read	the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.
		disk_merged	write	the number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.
		disk_octects	read	the number of octets read from a disk or partition
		disk_octects	write	the number of octets written to a disk or partition
		disk_ops	read	the number of read operations issued to the disk
		disk_ops	write	the number of write operations issued to the disk
		disk_time	read	the average time an I/O-operation took to complete. Note from collectd Since this is a little messy to calculate take the actual values with a grain of salt.
		disk_time	write	the average time an I/O-operation took to complete. Note from collectd Since this is a little messy to calculate take the actual values with a grain of salt.	https://collectd.org/wiki/index.php/Plugin:Disk
		pending_operations		shows queue size of pending I/O operations.	http://lxr.free-electrons.com/source/include/uapi/linux/if_link.h#L43
	Ping	ping		Network latency is measured as a round-trip time in milliseconds. An ICMP “echo request” is sent to a host and the time needed for its echo-reply to arrive is measured.	Latency
		ping_droprate		droprate = ((double) (pkg_sent - pkg_recv)) / ((double) pkg_sent);	https://github.com/collectd/collectd/blob/master/src/ping.c#L703
		ping_stddev		if pkg_recv > 1 latency_stddev = sqrt (((((double) pkg_recv) * latency_squared) - (latency_total * latency_total)) / ((double) (pkg_recv * (pkg_recv - 1))));	https://github.com/collectd/collectd/blob/master/src/ping.c#L698
					pkg_recv = # of echo-reply messages receivedlatency_squared = latency * latency (for a received echo-reply message)latency_total = the total latency for received echo-reply messages


	load	load	shortterm	load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 Minute	http://man7.org/linux/man-pages/man5/proc.5.html
			shortterm	measured CPU and IO utilization for 1 min using /proc/loadavg	https://github.com/collectd/collectd/blob/master/src/load.c
			midterm	load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 Minutes
			midterm	measured CPU and IO utilization for 5 mins using /proc/loadavg
			longterm	load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 Minutes
			longterm	measured CPU and IO utilization for 15 mins using /proc/loadavg
	OVS events	gauge	link_status	Link status of the OvS interface: UP or DOWN
	OVS Stats	if_collisions		Number of collisions.	per interface
		if_rx_octets		Number of received bytes.	http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf
		if_rx_errors	crc	Number of CRC errors.
		if_dropped rx:		Number of packets dropped by RX.
		if_errors rx:		Total number of receive errors, greater than or equal to the sum of the RX errors above.
		if_rx_errors	frame	Number of frame alignment errors.
		if_rx_errors	over	Number of packets with RX overrun.
		if_packets rx:		Number of received packets
		if_tx_octets		Number of transmitted bytes
		if_dropped tx:		Number of packets dropped by TX
		if_errors tx:		Total number of transmit errors, greater than or equal to the sum of the TX errors above.
		if_packets tx:		Number of transmitted packets
		if_packets rx:	1_to_64_packets	The total number of packets (including bad packets) received that were 64 octets in length (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
		if_packets rx:	65_to_127_packets	The total number of packets (including bad packets) received that were between 128 and 255 octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
		if_packets rx:	128_to_255_packets	The total number of packets (including bad packets) received that were between 256 and 511 octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
		if_packets rx:	256_to_511_packets	The total number of packets (including badpackets) received that were between 512 and 1023 octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
if_packets rx:		512_to_1023_packets	The total number of packets (including bad packets) received that were between 1024 and 1518 octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
if_packets rx:		1024_to_1522_packets	The total number of packets (including bad packets) received that were between 1523 and max octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
if_packets rx:		1523_to_max_packets	The total number of packets (including bad packets) received that were between 1523 and max octets in length inclusive (excluding framing bits but including FCS octets).	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		1_to_64_packets	The total number of packets transmitted that were 64 octets in length.	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		65_to_127_packets	The total number of packets received that were between 65 and 127 octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		128_to_255_packets	The total number of packets received that were between 128 and 255 octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		256_to_511_packets	The total number of packets received that were between 256 and 511 octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		512_to_1023_packets	The total number of packets received that were between 512 and 1023 octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		1024_to_1522_packets	The total number of packets received that were between 1024 and 1518 octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		1523_to_max_packets	The total number of packets received that were between 1523 and max octets in length inclusive	supported in OvS v2.6+ and dpdk ports only
if_multicast		tx_multicast_packets	The number of good packets transmitted that were directed to a multicast. Note: that this number does not include packets directed to the broadcast address	supported in OvS v2.6+ and dpdk ports only
if_packets rx:		broadcast_packets	The total number of packets (including bad packets, broadcast packets, and multicast packets) received.	supported in OvS v2.6+ and dpdk ports only
if_packets tx:		broadcast_packets	The number of good packets transmitted that were directed to the broadcast address.	supported in OvS v2.6+ and dpdk ports only
if_rx_errors		rx_undersized_errors	The total number of packets received that were less than 64 octets long (excluding framing bits, but including FCS octets) and were otherwise well formed.	supported in OvS v2.6+ and dpdk ports only
if_rx_errors		rx_oversize_errors	The total number of packets received that were longer than max octets (excluding framing bits, but including FCS octets) and were otherwise well formed.	supported in OvS v2.6+ and dpdk ports only
if_rx_errors		rx_fragmented_errors	The total number of packets received that were less than 64 octets in length (excluding framing bits but including FCS octets) and had either a bad Frame Check Sequence (FCS) with an integral number of octets (FCS Error) or a bad FCS with a non-integral number of octets (Alignment Error). Note: that it is entirely normal for rx_fragmented_errors to increment. This is because it counts both runts (which are normal occurrences due to collisions) and noise hits	supported in OvS v2.6+ and dpdk ports only
if_rx_errors		rx_jabber_errors	The total number of jabber packets received that had either a bad Frame Check Sequence (FCS) with an integral number of octets (FCS Error) or a bad FCS with a non-integral number of octets (Alignment Error).	supported in OvS v2.6+ and dpdk ports only
Hugepages	bytes	used	Number of used hugepages in bytes	total/pernode/both
	bytes	free	Number of free hugepages in bytes
	vmpage_number	used	Number of used hugepages in numbers
	vmpage_number	free	Number of free hugepages in numbers
	percent	used	Number of used hugepages in percent
	percent	free	Number of free hugepages in percent
processes	fork_rate		the number of threads created since the last reboot	The information comes mainly from /proc/PID/status, /proc/PID/psinfo and /proc/PID/usage.
	ps_state	blocked	the number of processes in a blocked state	https://collectd.org/wiki/index.php/Plugin:Processes
	ps_state	paging	the number of processes in a paging state	http://man7.org/linux/man-pages/man5/proc.5.html
	ps_state	running	the number of processes in a running state
	ps_state	sleeping	the number of processes in a sleeping state
	ps_state	stopped	the number of processes in a stopped state
	ps_state	zombies	the number of processes in a Zombie state
Host only	Libvirt	disk_octets	DISK	number of read/write bytes as unsigned long long.
		disk_ops	DISK	number of read/write requests
		disk_time	flush-DISK	total time spend on cache reads/writes in nano-seconds
		if_dropped	INTERFACE	packets dropped on rx/tx as unsigned long long
		if_errors	INTERFACE	rx/tx errors as unsigned long long
		if_octets	INTERFACE	bytes received/transmitted as unsigned long long
		if_packets	INTERFACE	packets received/transmitted as unsigned long long
		memory	actual_balloon	Resident Set Size of the process running the domain. This value is in kB	https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMemoryStatStruct
		memory	rss	How much the balloon can be inflated without pushing the guest system to swap, corresponds to 'Available' in /proc/meminfo
		memory	swap_in	The total amount of memory written out to swap space (in kB).
		memory	total	the memory in KBytes used by the domain
		virt_cpu_total		the CPU time used in nanoseconds
		virt_vcpu	VCPU_NR	the CPU time used in nanoseconds per cpu
		cpu_affinity	vcpu_NR-cpu_NR	pinning of domain VCPUs to host physical CPUs.	Value stored is a boolean.
		job_stats	*	Information about progress of a background/completed job on a domain.	Number of metrics depend on job type. Check API documentation for more information: virDomainGetJobStats
		disk_error	DISK_NAME	Disk error code	Metric isn’t dispatched for disk with no errors
		percent	virt_cpu_total	CPU utilization in percentage per domain
		perf	*	Performance monitoring events	Number of metrics depends on libvirt API version. Following perf metric are avilable in libvirt API version 2.4. To collectd perf metric they must be enabled in domain and supported by the platform.
		perf	perf_cmt	usage of l3 cache in bytes by applications running on the platform
		perf	perf_ mbmt	total system bandwidth from one level of cache
		perf	perf_ mbml	bandwidth of memory traffic for a memory controller
		perf	perf_cpu_cycles	the count of cpu cycles (total/elapsed)
		perf	perf_instructions	the count of instructions by applications running on the platform
		perf	perf_cache_references	the count of cache hits by applications running on the platform
		perf	perf_cache_misses	the count of cache misses by applications running on the platform
		ps_cputime		physical user/system cpu time consumed by the hypervisor
		total_requests	flush-DISK	total flush requests of the block device
		total_time_in_ms	flush-DISK	total time spend on cache flushing in milliseconds
	RDT	ipc		Number of instructions per clock per core group	per core group
		memory_bandwidth	local	Local Memory Bandwidth utilization
		memory_bandwidth	remote	Remote Memory Bandwidth utilization
		bytes	llc	Last Level Cache occupancy
Host/guest	dpdkstats	derive	rx_l3_l4_xsum_error	Number of receive IPv4, TCP, UDP or SCTP XSUM errors.
		errors	flow_director_filter_add_errors	Number of failed added filters	compatible with DPDK 16.04, 16.07 (based on ixgbe, vhost support will be enabled in DPDK 16.11)
			flow_director_filter_remove_errors	Number of failed removed filters
			mac_local_errors	Number of faults in the local MAC.
			mac_remote_errors	Number of faults in the remote MAC.
		if_rx_dropped	rx_fcoe_dropped	Number of Rx packets dropped due to lack of descriptors.
			rx_mac_short_packet_dropped	Number of MAC short packet discard packets received.
			rx_management_dropped	Number of management packets dropped. This register counts the total number of packets received that pass the management filters and then are dropped because the management receive FIFO is full. Management packets include any packet directed to the manageability console (such as RMCP and ARP packets).
			rx_priorityX_dropped	Number of dropped packets received per UP	where X is 0 to 7
		if_rx_errors	rx_crc_errors	Counts the number of receive packets with CRC errors. In order for a packet to be counted in this register, it must be 64 bytes or greater (from <Destination Address> through <CRC>, inclusively) in length.
			rx_errors	Number of errors received
			rx_fcoe_crc_errors	FC CRC Count.
				Count the number of packets with good Ethernet CRC and bad FC CRC
			rx_fcoe_mbuf_allocation_errors	Number of fcoe Rx packets dropped due to lack of descriptors.
			rx_fcoe_no_direct_data_placement
			rx_fcoe_no_direct_data_placement_ext_buff
			rx_fragment_errors	Number of receive fragment errors (frame shorted than 64 bytes from <Destination Address> through <CRC>, inclusively) that have bad CRC (this is slightly different from the Receive Undersize Count register).
			rx_illegal_byte_errors	Counts the number of receive packets with illegal bytes errors (such as there is an illegal symbol in the packet).
			rx_jabber_errors	Number of receive jabber errors. This register counts the number of received packets that are greater than maximum size and have bad CRC (this is slightly different from the Receive Oversize Count register). The packets length is counted from <Destination Address> through <CRC>, inclusively.
			rx_length_errors	Number of packets with receive length errors. A length error occurs if an incoming packet length field in the MAC header doesn't match the packet length.
			rx_mbuf_allocation_errors	Number of Rx packets dropped due to lack of descriptors.
			rx_oversize_errors	eceive Oversize Error. This register counts the number of received frames that are longer than maximum size as defined by MAXFRS.MFS (from <Destination Address> through <CRC>, inclusively) and have valid CRC.
			rx_priorityX_mbuf_allocation_errors	Number of received packets per UP dropped due to lack of descriptors.	where X is 0 to 7
			rx_q0_errors	Number of errors received for the queue.	if more queues are allocated then you get the errors per Queue
			rx_undersize_errors	Receive Undersize Error. This register counts the number of received frames that are shorter than minimum size (64 bytes from <Destination Address> through <CRC>, inclusively), and had a valid CRC.
		if_rx_octets	rx_error_bytes	Counts the number of receive packets with error bytes (such as there is an error symbol in the packet). This registers counts all packets received, regardless of L2 filtering and receive enablement.	bug - will move this to errors
			rx_fcoe_bytes	number of received fcoe bytes
			rx_good_bytes	Good octets/bytes received count. This register includes bytes received in a packet from the <Destination Address> field through the <CRC> field, inclusively.
			rx_q0_bytes	Number of bytes received for the queue.	per queue
			rx_total_bytes	Total received octets. This register includes bytes received in a packet from the <Destination Address> field through the <CRC> field, inclusively.
		if_rx_packets	rx_broadcast_packets	Number of good (non-erred) broadcast packets received.
			rx_fcoe_packets	Number of FCoE packets posted to the host. In normal operation (no save bad frames) it equals to the number of good packets.
			rx_flow_control_xoff_packets	Number of XOFF packets received. This register counts any XOFF packet whether it is a legacy XOFF or a priority XOFF. Each XOFF packet is counted once even if it is designated to a few priorities.
			rx_flow_control_xon_packets	Number of XON packets received. This register counts any XON packet whether it is a legacy XON or a priority XON. Each XON packet is counted once even if it is designated to a few priorities.
			rx_good_packets	Number of good (non-erred) Rx packets (from the network).
			rx_management_packets	Number of management packets received. This register counts the total number of packets received that pass the management filters. Management packets include RMCP and ARP packets. Any packets with errors are not counted, except for the packets that are dropped because the management receive FIFO is full are counted.
			rx_multicast_packets	Number of good (non-erred) multicast packets received (excluding broadcast packets). This register does not count received flow control packets.
			rx_priorityX_xoff_packets	Number of XOFF packets received per UP	where X is 0 to 7
			rx_priorityX_xon_packets	Number of XON packets received per UP	where X is 0 to 7
			rx_q0_packets	Number of packets received for the queue.	per queue
			rx_size_1024_to_max_packets	Number of packets received that are 1024-max bytes in length (from <Destination Address> through <CRC>, inclusively). This registers does not include received flow control packets. The maximum is dependent on the current receiver configuration and the type of packet being received. If a packet is counted in receive oversized count, it is not counted in this register. Due to changes in the standard for maximum frame size for VLAN tagged frames in 802.3, packets can have a maximum length of 1522 bytes.
			rx_size_128_to_255_packets	Number of packets received that are 128-255 bytes in length (from <Destination Address> through <CRC>, inclusively).
			rx_size_256_to_511_packets	Number of packets received that are 256-511 bytes in length (from <Destination Address> through <CRC>, inclusively).
			rx_size_512_to_1023_packets	Number of packets received that are 512-1023 bytes in length (from <Destination Address> through <CRC>, inclusively).
			rx_size_64_packets	Number of good packets received that are 64 bytes in length (from <Destination Address> through <CRC>, inclusively).
			rx_size_65_to_127_packets	Number of packets received that are 65-127 bytes in length (from <Destination Address> through <CRC>, inclusively)
			rx_total_missed_packets	the total number of rx missed packets, that is is a packet that was correctly received by the NIC but because it was out of descriptors and internal memory, the packet had to be dropped by the NIC itself
			rx_total_packets	Number of all packets received. This register counts the total number of all packets received. All packets received are counted in this register, regardless of their length, whether they are erred, but excluding flow control packets.
			rx_xoff_packets	Number of XOFF packets received. Sticks to 0xFFFF. XOFF packets can use the global address or the station address. This register counts any XOFF packet whether it is a legacy XOFF or a priority XOFF. Each XOFF packet is counted once even if it is designated to a few priorities. If a priority FC packet contains both XOFF and XON, only this counter is incremented.
			rx_xon_packets	Number of XON packets received. XON packets can use the global address, or the station address. This register counts any XON packet whether it is a legacy XON or a priority XON. Each XON packet is counted once even if it is designated to a few priorities. If a priority FC packet contains both XOFF and XON, only the LXOFFRXCNT counter is incremented.
		if_tx_errors	tx_errors	Total number of TX error packets
		if_tx_octets	tx_fcoe_bytes	Number of fcoe bytes transmitted
			tx_good_bytes	counter of successfully transmitted octets. This register includes transmitted bytes in a packet from the <Destination Address> field through the <CRC> field, inclusively.
			tx_q0_bytes	Number of bytes transmitted by the queue.	per queue
		if_tx_packets	tx_broadcast_packets	Number of broadcast packets transmitted count. This register counts all packets, including standard packets, secure packets, FC packets and manageability packets
			tx_fcoe_packets	Number of fcoe packets transmitted
			tx_flow_control_xoff_packets	Link XOFF Transmitted Count
			tx_flow_control_xon_packets	Link XON Transmitted Count
			tx_good_packets	Number of good packets transmitted
			tx_management_packets	Number of management packets transmitted.
			tx_multicast_packets	Number of multicast packets transmitted. This register counts the number of multicast packets transmitted. This register counts all packets, including standard packets, secure packets, FC packets and manageability packets.
			tx_priorityX_xoff_packets	Number of XOFF packets transmitted per UP	where X is 0 to 7
			tx_priorityX_xon_packets	Number of XON packets transmitted per UP	where X is 0 to 7
			tx_q0_packets	Number of packets transmitted for the queue. A packet is considered as transmitted if it is was forwarded to the MAC unit for transmission to the network and/or is accepted by the internal Tx to Rx switch enablement logic. Packets dropped due to anti-spoofing filtering or VLAN tag validation (as described in Section 7.10.3.9.2) are not counted.	per queue
tx_size_1024_to_max_packets	Number of packets transmitted that are 1024 or more bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets.
tx_size_128_to_255_packets	Number of packets transmitted that are 128-255 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets
tx_size_256_to_511_packets	Number of packets transmitted that are 256-511 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets.
tx_size_512_to_1023_packets	Number of packets transmitted that are 512-1023 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets.
tx_size_64_packets	Number of packets transmitted that are 64 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, FC packets, and manageability packets.
tx_size_65_to_127_packets	Number of packets transmitted that are 65-127 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets.
tx_total_packets	Number of all packets transmitted. This register counts the total number of all packets transmitted. This register counts all packets, including standard packets, secure packets, FC packets, and manageability packets.
tx_xoff_packets	Number of XOFF packets transmitted
tx_xon_packets	Number of XON packets transmitted
operations	flow_director_added_filters	This field counts the number of added filters to the flow director filters logic.
	flow_director_matched_filters	This field counts the number of matched filters to the flow director filters logic.
	flow_director_missed_filters	This field counts the number of missed filters to the flow director filters logic.
	flow_director_removed_filters	This field counts the number of removed filters from the flow director filters logic.
mcelog (RAS memory)	errors	corrected_memory_errors	The total number of hardware errors that were corrected by the hardware (e.g. using a single bit data corruption that was correctible using ECC). These errors do not require immediate software actions, but are still reported for accounting and predictive failure analysis.	Memory (RAM) errors are among the most common errors in typical server systems. They also scale with the amount of memory: the more memory the more errors. In addition large clusters of computers with tens or hundreds (or sometimes thousands) of active machines increase the total error rate of the system.
		uncorrected_memory_error	the total number of uncorrected hardware errors detected by the hardware. Data corruption has occurred. These errors require software reaction.	http://www.mcelog.org/memory.html
		corrected_memory_errors_in_%s	The total number of hardware errors that were corrected by the hardware in a certain period of time	where %s is a timed period like 24 hours
				http://www.mcelog.org/memory.html
		uncorrected_memory_errors_in_%s	the total number of uncorrected hardware errors detected by the hardware in a certain period of time	where %s is a timed period like 24 hours
				http://www.mcelog.org/memory.html
Host	IPMI (specific per BMC) so these will change depending on what's supported by the BMC. This is en example for S2600WT2R platform	percent	MTT CPU2	IPMI defines many types of sensors, but groups them into two main categories: Threshold and discrete. Threshold sensors are “analog”, they have continuous (or mostly continuous) readings. Things like fans speed, voltage, or temperature. Discrete sensors have a set of binary readings that may each be independently zero or one. In some sensors, these may be independent. For instance, a power supply may have both an external power failure and a predictive failure at the same time. In other cases they may be mutually exclusive. For instance, each bit may represent the initialization state of a piece of software.	The IPMI plugin supports analog sensors of type voltage, temperature, fan and current + analog sensors that have VALUE type WATTS, CFM and percentage (%). http://openipmi.sourceforge.net/IPMI.pdf
			MTT CPU1
			P2 Therm Ctrl %
			P1 Therm Ctrl %
			PS1 Curr Out %
		voltage	BB +3.3V Vbat
		voltage	BB +12.0V
		temperature	Agg Therm Mgn 1
			DIMM Thrm Mrgn 4
			DIMM Thrm Mrgn 3
			DIMM Thrm Mrgn 2
			DIMM Thrm Mrgn 1
			P2 DTS Therm Mgn
			P1 DTS Therm Mgn
			P2 Therm Ctrl %
			P1 Therm Ctrl %
			P2 Therm Margin
			P1 Therm Margin
			PS1 Temperature
			LAN NIC Temp
			Exit Air Temp
			HSBP 1 Temp
			I/O Mod Temp
			BB Lft Rear Temp
			BB Rt Rear Temp
			BB BMC Temp
			SSB Temp
			Front Panel Temp
			BB P2 VR Temp
			BB P1 VR Temp
		fan	System Fan 6B
			System Fan 6A
			System Fan 5B
			System Fan 5A
			System Fan 4B
			System Fan 4A
			System Fan 3B
			System Fan 3A
			System Fan 2B
			System Fan 2A
			System Fan 1B
			System Fan 1A
		CFM	System Airflow
		watts	PS1 Input Power
Host	intel_pmu	counter	cpu-cycles	[Hardware event]	The types of events are: Hardware Events: These instrument low-level processor activity based on CPU performance counters. For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc. Some will be listed as Hardware Cache Events. Software Events: These are low level events based on kernel counters. For example, CPU migrations, minor faults, major faults, etc. http://www.brendangregg.com/perf.html#Events
			instructions
			cache-references
			cache-misses
			branch-instructionsORbranches
			branch-misses
			bus-cycles
			cpu-clock	[Software event]
			task-clock
			page-faultsORfaults
			minor-faults
			major-faults
			context-switchesORcs
			cpu-migrationsORmigrations
			alignment-faults
			emulation-faults
			L1-dcache-loads	[Hardware cache event]
			L1-dcache-load-misses
			L1-dcache-stores
			L1-dcache-store-misses
			L1-dcache-prefetches
			L1-dcache-prefetch-misses
			L1-icache-loads
			L1-icache-load-misses
			L1-icache-prefetches
			L1-icache-prefetch-misses
			LLC-loads
			LLC-load-misses
			LLC-stores
			LLC-store-misses
			LLC-prefetch-misses
			dTLB-loads
			dTLB-load-misses
			dTLB-stores
			dTLB-store-misses
			dTLB-prefetches
			dTLB-prefetch-misses
			iTLB-loads
			iTLB-load-misses
			branch-loads
			branch-load-misses

Events

NOTE: Collectd can generate events based on thresholds for any of the metrics reported in the table above. For more info please see: https://collectd.org/documentation/manpages/collectd.conf.5.shtml#threshold_configuration

Where collectd is running	Plugin	Type	Type Instance	Severity	Description	comment
host/guest	ovs_events	gauge	link_status	Warning on Link Status Down	Link status of the OvS interface: UP or DOWN Severity will be configurable by the end user
				OKAY on link Status Up
host/guest	dpdk_events		link_status	Warning on Link Status Down, OKAY on link status up	Link status of the OvS interface: UP or DOWN Severity will be configurable by the end user	Depending on plugin configuration, can be dispatched as a metric or event.
			keep_alive	OKAY: if core status is ALIVE, UNUSED, DOZING, SLEEP Warning: if core status is MISSING Failure: if core status is DEAD or GONE	Reflects the state of DPDK packet processing cores	protects against packet processing core failures for DPDK --> no slient packet drops. Depending on plugin configuration, can be dispatched as a metric or event.
host	pcie	pcie_error	correctable	Notification (Warning) in case of PCIe correctable error occurrence. Message contains short error description.	Correctable Errors include: Receiver Error Status Bad TLP Status Bad DLLP Status REPLAY_NUM Rollover Replay Timer Timeout Advisory Non-Fatal Corrected Internal Header Log Overflow Uncorrectable Errors include: Data Link Protocol Surprise Down Poisoned TLP Flow Control Protocol Completion Timeout Completer Abort Unexpected Completion Receiver Overflow Malformed TLP ECRC Error Status Unsupported Request ACS Violation Internal MC blocked TLP Atomic egress blocked TLP prefix blocked
			fatal	Notification (Failure) in case of PCIe uncorrectable fatal error occurrence. Message contains short error description.
			non_fatal	Notification (Warning) in case of PCIe uncorrectable non-fatal error occurrence. Message contains short error description.
host	mcelog (RAS memory)	errors		Warning for Corrected Memory Errors Failure for Uncorrected Memory Errors	Failure on failure to connect to the mcelog socket/ if connection is lost OK on connection to mcelog socket	Reports Corrected and Uncorrected DIMM Failures
host	IPMI			OKAY - upper non-critical	Each IPMI sensor may have six different thresholds: upper non-recoverable upper critical upper non-critical lower non-critical lower critical lower non-recoverable	You may have events on a threshold sensor by specifying values (called thresholds) where you want the sensor to report an event. Then you can enable the events for the specific thresholds. Not all sensors support all thresholds, some cannot have their events enabled and others cannot have them disabled. The capabilities of a sensor may all be queried by the user to determine what it can do. When the value of the sensor goes outside the threshold an event may be generated. An event may be generated when the value goes back into the threshold
				OKAY - lower non-critical
				WARNING- lower critical
				WARNING - upper critical
				FAILURE - upper non-recoverable
				FAILURE - lower non-recoverable
				discrete sensor status changes are also reported out via OKAY, WARNING and FAILURE notifications. Examples of discrete sensors can be found under the "IPMI Sensors for S2600WT2R" tab
host	mcelog RAS System, CPU, QPI, OI (specific to a Platform) so these will change depending on what's supported by the Platform.			WARNING - Correctable errors FAILURE - Uncorrectable Errors	Servers based on Intel® Architecture, are generally designed for use in mission critical environments. Reliability, Availability and Serviceability (RAS) features, are integrated into the servers to address the error handling and memory mirroring and sparing required by these environments. The goal of this feature is to expose the RAS features provided by the Broadwell or newer platfrom to higher level fault management applications. The Features to be exposed fall under the following catagories: Reliability Features: -System attributes to ensure Data integrity. -capability to prevent, detect, correct and contain faults over a given time interval. Availability Features: -System attributes to help stay operational in the presence of faults in the system. -Capability to map out failed units, ability to operate in a degraded mode. Serviceability Features: -System attributes to help system service, repair. -Capability to identify failed units, and facilitates repair. Generic Error Handling The Silicon supports corrected, uncorrected (recoverable, unrecoverable), fatal and catastrophic error types. Corrected Errors Errors that are corrected by either hardware or software, corrected error information is used in predictive failure analysis by the OS. MCA Banks corrected errors except selected memory corrected errors are handled directly by the OS. HASWELL-EP PROCESSOR triggers CMCI for the corrected errors, on CMCI OS can read the MCA Banks and collect error status. All the other platform related corrected errors can either be ignored or can be logged into BMC SEL based on platform policy. Memory Corrected Errors Memory corrected errors such as mirror fail over, memory read errors can be configured to trigger SMI using BIOS setup options. On memory mirror fail over BIOS logs the error for the OS as per the UEFI error record format. On memory read errors, BIOS does the following memory RAS operations in the order to correct the error. Rank Sparing SDDC/Device tagging UnCorrected Non Fatal Errors Errors that are not corrected by hardware, in general these errors trigger machine check exception and in turn triggers SMI. BIOS SMI handler logs these error information, clear the error status and pass the error log to OS. OS can recover from the error, in cases where the recovery is not an option, can trigger a system reset. Uncorrected Fatal Errors Errors that are neither corrected by hardware nor recovered by the s/w, the system is not in a reliable state and needs a reset to bring it back up to normal operation. In most fatal error conditions, BIOS cannot log errors before the system reset happens. All the Error status registers are sticky on the reset, BIOS collects all these information in the next boot, creates error record and pass it on the OS. Error Logging Example Errors are provided in the comments tab are for the Purley platform	/* See IA32 SDM Vol3B Chapter 16*/ Integrated Memory Controller Machine Check Errors "Address parity error", "HA write data parity error", "HA write byte enable parity error", "Corrected patrol scrub error", "Uncorrected patrol scrub error", "Corrected spare error", "Uncorrected spare error", "Any HA read error", "WDB read parity error", "DDR4 command address parity error", "Uncorrected address parity error" "Unrecognized request type", "Read response to an invalid scoreboard entry", "Unexpected read response", "DDR4 completion to an invalid scoreboard entry", "Completion to an invalid scoreboard entry", "Completion FIFO overflow", "Correctable parity error", "Uncorrectable error", "Interrupt received while outstanding interrupt was not ACKed", "ERID FIFO overflow", "Error on Write credits", "Error on Read credits", "Scheduler error", "Error event", "MscodDataRdErr", "Reserved", "MscodPtlWrErr", "MscodFullWrErr", "MscodBgfErr", "MscodTimeout", "MscodParErr", "MscodBucket1Err"
						Interconnect(QPI) Machine Check Errors "UC Phy Initialization Failure", "UC Phy detected drift buffer alarm", "UC Phy detected latency buffer rollover", "UC LL Rx detected CRC error: unsuccessful LLR: entered abort state", "UC LL Rx unsupported or undefined packet", "UC LL or Phy control error", "UC LL Rx parameter exchange exception", "UC LL detected control error from the link-mesh interface", "COR Phy initialization abort", "COR Phy reset", "COR Phy lane failure, recovery in x8 width", "COR Phy L0c error corrected without Phy reset", "COR Phy L0c error triggering Phy Reset", "COR Phy L0p exit error corrected with Phy reset", "COR LL Rx detected CRC error - successful LLR without Phy Reinit", "COR LL Rx detected CRC error - successful LLR with Phy Reinit" "Phy Control Error", "Unexpected Retry.Ack flit", "Unexpected Retry.Req flit", "RF parity error", "Routeback Table error", "unexpected Tx Protocol flit (EOP, Header or Data)", "Rx Header-or-Credit BGF credit overflow/underflow", "Link Layer Reset still in progress when Phy enters L0", "Link Layer reset initiated while protocol traffic not idle", "Link Layer Tx Parity Error"
						Internal Machine Check Errors "No Error", "MCA_DMI_TRAINING_TIMEOUT", "MCA_DMI_CPU_RESET_ACK_TIMEOUT", "MCA_MORE_THAN_ONE_LT_AGENT", "MCA_BIOS_RST_CPL_INVALID_SEQ", "MCA_BIOS_INVALID_PKG_STATE_CONFIG", "MCA_MESSAGE_CHANNEL_TIMEOUT", "MCA_MSGCH_PMREQ_CMP_TIMEOUT", "MCA_PKGC_DIRECT_WAKE_RING_TIMEOUT", "MCA_PKGC_INVALID_RSP_PCH", "MCA_PKGC_WATCHDOG_HANG_CBZ_DOWN", "MCA_PKGC_WATCHDOG_HANG_CBZ_UP", "MCA_PKGC_WATCHDOG_HANG_C3_UP_SF", "MCA_SVID_VCCIN_VR_ICC_MAX_FAILURE", "MCA_SVID_COMMAND_TIMEOUT", "MCA_SVID_VCCIN_VR_VOUT_FAILURE", "MCA_SVID_CPU_VR_CAPABILITY_ERROR", "MCA_SVID_CRITICAL_VR_FAILED", "MCA_SVID_SA_ITD_ERROR", "MCA_SVID_READ_REG_FAILED", "MCA_SVID_WRITE_REG_FAILED", "MCA_SVID_PKGC_INIT_FAILED", "MCA_SVID_PKGC_CONFIG_FAILED", "MCA_SVID_PKGC_REQUEST_FAILED", "MCA_SVID_IMON_REQUEST_FAILED", "MCA_SVID_ALERT_REQUEST_FAILED", "MCA_SVID_MCP_VR_ABSENT_OR_RAMP_ERROR", "MCA_SVID_UNEXPECTED_MCP_VR_DETECTED", "MCA_FIVR_CATAS_OVERVOL_FAULT", "MCA_FIVR_CATAS_OVERCUR_FAULT", "MCA_WATCHDOG_TIMEOUT_PKGC_SLAVE", "MCA_WATCHDOG_TIMEOUT_PKGC_MASTER", "MCA_WATCHDOG_TIMEOUT_PKGS_MASTER", "MCA_PKGS_CPD_UNCPD_TIMEOUT", "MCA_PKGS_INVALID_REQ_PCH", "MCA_PKGS_INVALID_REQ_INTERNAL", "MCA_PKGS_INVALID_RSP_INTERNAL", "MCA_PKGS_SMBUS_VPP_PAUSE_TIMEOUT", "MCA_RECOVERABLE_DIE_THERMAL_TOO_HOT"
host	virt	domain_state		OKAY: VIR_DOMAIN_NOSTATE VIR_DOMAIN_RUNNING VIR_DOMAIN_SHUTDOWN VIR_DOMAIN_SHUTOFF	Domain state and reason in a human-readable format.
				WARNING: VIR_DOMAIN_BLOCKED VIR_DOMAIN_PAUSED VIR_DOMAIN_PMSUSPENDED
				FAILURE: VIR_DOMAIN_CRASHED
host	virt	file_system		OKAY	File system information (mountpoint, device name, filesystem type, number of aliases, disk aliases)	Information stored in metadata. Requires Guest Agent to be installed and configured in VM.

SNMP interface

SNMP interface in collectd provides access to collected metrics using SNMP Agent plugin. This plugin is an AgentX subagent that receives and handles queries from SNMP master agent and returns the metrics collected by "read" (collector) plugins. The plugin handles requests only for OIDs specified in configuration file. To handle SNMP queries the plugin gets data from collectd and translates requested values from collectd's internal format to SNMP format. This plugin is a generic plugin and cannot work without configuration. For more details on configuration file see <https://github.com/collectd/collectd/pull/2105/files#diff-9fc6980794a396e7288e1bd17c59a358>

Space shortcuts

Page tree