Anuket Project

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 41 Next »


Statistics

Statistics in collectd consist of a value list. A value list includes:

 

Value listExamplecomment
Values 99.8999percentage
Value lengththe number of values in the data set.  
Timetimestamp at which the value was collected.1475837857epoch
Intervalinterval at which to expect a new value.10interval
Hostused to identify the host.localhostcan be uuid for vm or host… or can give host a name
Pluginused to identify the plugin.cpu 
Plugin instance (optional)used to group a set of values together. For e.g. values belonging to a DPDK interface.0 
Typeunit used to measure a value. In other words used to refer to a data set.percent 
Type instance (optional)used to distinguish between values that have an identical type.user 
meta dataan opaque data structure that enables the passing of additional information about a value list. “Meta data in the global cache can be used to store arbitrary information about an identifier”   

Notifications


Notifications in collectd are generic messages containing:

An associated severity, which can be one of OKAY, WARNING, and FAILURE.
A time.      
A Message     
A host.      
A plugin.      
A plugin instance (optional).    
A type.      
A types instance (optional).    
Meta-data.     


Example notification:

 

Severity:FAILURE
Time:1472552207.385
Host:pod3-node1
Plugin:dpdkevents
PluginInstance:dpdk0
Type:gauge
TypeInstance:link_status
DataSource:value
CurrentValue:1.000000e+00
WarningMin:nan
WarningMax:nan
FailureMin:2.000000e+00
FailureMax:nan
Hostpod3-node1, plugin dpdkevents (instance dpdk0) type gauge (instance link_status): Data source "value" is currently 1.000000. That is below the failure threshold of 2.000000.

 

Supported Metrics and Events

Dynamic Metrics

Reference starting point: https://github.com/collectd/collectd/blob/master/src/types.db  

But below is a mapping of the "base" plugins that would run on the host/the guest.

Where collectd is runningPluginTypeType InstanceDescriptioncomment
Host/guestCPUpercent/nanosecondsidleTime CPU spends idle
 

Can be per cpu/aggregate across all the cpus.

For more info, please see:

http://man7.org/linux/man-pages/man1/top.1.html

 

http://blog.scoutapp.com/articles/2015/02/24/understanding-linuxs-cpu-stats

Note that jiffies operate on a variable time base, HZ. The default value of HZ should be used (100), yielding a jiffy value of 0.01 seconds) [time(7)]. Also, the actual number of jiffies in each second is subject to system factors, such as use of virtualization. Thus, the percent calculation based on jiffies will nominally sum to 100% plus or minus error.




percent/nanosecondsniceTime the CPU spent running user space processes that have been niced. The priority level a user space process can be tweaked by adjusting its niceness.
percent/nanosecondsinterruptTime the CPU has spent servicing interrupts.
percent/nanosecondssoftirq(apparently) Time spent handling interrupts that are synthesized, and almost as important as Hardware interrupts (above). "In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time." [Ref]
percent/nanosecondsstealCPU steal is a measure of the fraction of time that a machine is in a state of “involuntary wait.”  It is time for which the kernel cannot otherwise account in one of the traditional classifications like user, system, or idle.  It is time that went missing, from the perspective of the kernel.

http://www.stackdriver.com/understanding-cpu-steal-experiment/

percent/nanosecondssystemTime that the CPU spent running the kernel.
percent/nanosecondsuserTime CPU spends running un-niced user space processes
percent/nanosecondswaitThe time the CPU spends idle while waiting for an I/O operation to complete
Interfaceif_droppedinThe total number of received dropped packets.
http://www.onlamp.com/pub/a/linux/2000/11/16/LinuxAdmin.html






if_errorsinThe total number of received error packets.
if_octetsinThe total number of received bytes.
if_packetsinThe total number of received packets.
if_droppedoutThe total number of transmit packets dropped
if_errorsoutThe total number of transmit error packets. (This is the total of error conditions encountered when attempting to transmit a packet. The code here explains the possibilities, but this code is no longer present in /net/core/dev.c  master at present - it appears to have moved to /net/core/net-procfs.c.)
if_octetsoutThe total number of bytes transmitted
if_packetsoutThe total number of transmitted packets
MemorymemorybufferedThe amount, in kibibytes, of temporary storage for raw disk blocks.https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html
memorycachedThe amount of physical RAM, in kibibytes, left unused by the system.
memoryfreeThe amount of physical RAM, in kibibytes, left unused by the system.
memoryslab_reclThe part of Slab that can be reclaimed, such as caches.Slab — The total amount of memory, in kibibytes, used by the kernel to cache data structures for its own use
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-proc-meminfo.html
memoryslab_unreclThe part of Slab that cannot be reclaimed even when lacking memory
memoryusedmem_used = mem_total - (mem_free + mem_buffered + mem_cached + mem_slab_total);https://github.com/collectd/collectd/blob/master/src/memory.c#L349
diskdisk_io_timeio_timetime spent doing I/Os (ms). You can treat this metric as a device load percentage (Value of 1 sec time spent matches 100% of load).








https://collectd.org/wiki/index.php/Plugin:Disk
http://lxr.free-electrons.com/source/include/uapi/linux/if_link.h#L43
disk_io_timeweighted_io_timemeasure of both I/O completion time and the backlog that may be accumulating.
disk_mergedreadthe number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.
disk_mergedwritethe number of operations, that could be merged into other, already queued operations, i. e. one physical disk access served two or more logical operations. Of course, the higher that number, the better.
disk_octectsreadthe number of octets read from a disk or partition
disk_octectswritethe number of octets written to a disk or partition
disk_opsreadthe number of read operations issued to the disk
disk_opswritethe number of write operations issued to the disk
disk_timereadthe average time an I/O-operation took to complete. Note from collectd Since this is a little messy to calculate take the actual values with a grain of salt.
disk_timewritethe average time an I/O-operation took to complete. Note from collectd Since this is a little messy to calculate take the actual values with a grain of salt.
pending_operations shows queue size of pending I/O operations.
Pingping Network latency is measured as a round-trip time in milliseconds. An ICMP “echo request” is sent to a host and the time needed for its echo-reply to arrive is measured.Latency
ping_droprate droprate = ((double) (pkg_sent - pkg_recv)) / ((double) pkg_sent);https://github.com/collectd/collectd/blob/master/src/ping.c#L703
ping_stddev if pkg_recv > 1
latency_stddev = sqrt (((((double) pkg_recv) * latency_squared) - (latency_total * latency_total)) / ((double) (pkg_recv * (pkg_recv - 1))));

https://github.com/collectd/collectd/blob/master/src/ping.c#L698

pkg_recv = # of echo-reply messages received

latency_squared = latency * latency (for a received echo-reply message)

latency_total = the total latency for received echo-reply messages


loadloadshortterm
load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 Minute

measured CPU and IO utilization for 1 min using
/proc/loadavg
http://man7.org/linux/man-pages/man5/proc.5.html

https://github.com/collectd/collectd/blob/master/src/load.c
loadmidterm
load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 5 Minutes

measured CPU and IO utilization for 5 mins using
 /proc/loadavg
loadlongterm
load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 15 Minutes

measured CPU and IO utilization for 15 mins using
 /proc/loadavg
OVS eventsgaugelink_statusLink status of the OvS interface: UP or DOWN 
OVS Stats  collisions Number of collisions.per interface
  rx_bytes Number of received bytes.http://openvswitch.org/ovs-vswitchd.conf.db.5.pdf










  rx_crc_err Number of CRC errors.
  rx_dropped Number of packets dropped by RX.
  rx_errors Total number of receive errors, greater than or equal to the sum of the RX errors above.
  rx_frame_err Number of frame alignment errors.
  rx_over_err Number of packets with RX overrun.
  rx_packets Number of received packets
  tx_bytes Number of transmitted bytes
  tx_dropped Number of packets dropped by TX
  tx_errors Total number of transmit errors, greater than or equal to the sum of the TX errors above.
  tx_packets Number of transmitted packets
HugepagesbytesusedNumber of used hugepages in bytestotal/pernode/both
bytesfreeNumber of free hugepages in bytes
vmpage_numberusedNumber of used hugepages in numbers
vmpage_numberfreeNumber of free hugepages in numbers
percentusedNumber of used hugepages in percent
percentfreeNumber of free hugepages in percent
processesfork_rate the number of threads created since the last rebootThe information comes mainly from /proc/PID/status, /proc/PID/psinfo and /proc/PID/usage.
https://collectd.org/wiki/index.php/Plugin:Processes
http://man7.org/linux/man-pages/man5/proc.5.html
ps_stateblockedthe number of processes in a blocked state
ps_statepagingthe number of processes in a paging state
ps_staterunningthe number of processes in a running state
ps_statesleepingthe number of processes in a sleeping state
ps_statestoppedthe number of processes in a stopped state
ps_statezombiesthe number of processes in a Zombie state
Host onlyLibvirmtdisk_octetsreadnumber of read bytes as unsigned long long. 
disk_octetswritenumber of written bytes as unsigned long long 
disk_opsreadnumber of read requests  
disk_opswritenumber of write requests  
if_droppedinreceive packets dropped as unsigned long longhttps://libvirt.org/html/libvirt-libvirt-domain.html







if_droppedouttransmit packets dropped as unsigned long long
if_errorsin receive errors as unsigned long long
if_errorsouttransmission errors as unsigned long long.
if_octetsinbytes received as unsigned long long
if_octetsoutbytes transmitted as unsigned long long
if_packetsinpackets received as unsigned long long
if_packetsoutpackets transmitted as unsigned long long
memoryactual_balloonResident Set Size of the process running the domain. This value is in kBhttps://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMemoryStatStruct
memoryrssHow much the balloon can be inflated without pushing the guest system to swap, corresponds to 'Available' in /proc/meminfo
memoryswap_inThe total amount of memory written out to swap space (in kB).
memorytotalthe memory in KBytes used by the domain
virt_cpu_total the CPU time used in nanosecondsThis is in nanoseconds !
virt_vcpuvcpu_nrthe CPU time used in nanoseconds per cpuThis is in nanoseconds !
cpu_affinityvcpu_NR-cpu_NRpinning of domain VCPUs to host physical CPUs (Value stored is a boolean) 
domain_state Domain state and reason 
file_system File system information (mountpoint, device name, filesystem type, number of aliases, disk aliases) Dispatched as notification. Requires guest agent to be installed and configured. 
job_stats 

Information about progress of a background/completed job on a domain.

Check API documentation for more information. (https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetJobStats)

 
disk_errorDISK_NAMEDisk error code (Metric isn’t dispatched for disk with no errors) 
perfperf_cmtusage of l3 cache in bytes by applications running on the platform 
perfperf_ mbmttotal system bandwidth from one level of cache 
perfperf_ mbmlbandwidth of memory traffic for a memory controller 
perfperf_cpu_cyclesthe count of cpu cycles (total/elapsed) 
perfperf_instructionsthe count of instructions by applications running on the platform 
perfperf_cache_referencesthe count of cache hits by applications running on the platform 
perfperf_cache_missesthe count of cache misses by applications running on the platform 
RDTipc Number of instructions per clock per core groupper core group
memory_bandwidthlocalLocal Memory Bandwidth utilization
memory_bandwidthremoteRemote Memory Bandwidth utilization
bytesllcLast Level Cache occupancy

dpdkstats

compatible with DPDK 16.04

(based on ixgbe, vhost

support will be enabled in

DPDK 16.11, patch

support being upgraded

to DPDK 16.07 in progress)

deriverx_l3_l4_xsum_error Number of receive IPv4, TCP, UDP or SCTP XSUM errors. 
errorsflow_director_filter_add_errorsNumber of failed added filters 
flow_director_filter_remove_errorsNumber of failed removed filters 
mac_local_errorsNumber of faults in the local MAC. 
mac_remote_errorsNumber of faults in the remote MAC. 
if_rx_droppedrx_fcoe_droppedNumber of Rx packets dropped due to lack of descriptors. 
rx_mac_short_packet_droppedNumber of MAC short packet discard packets received. 
rx_management_droppedNumber of management packets dropped. This register counts the total number of packets received that pass the management filters and then are dropped because the management receive FIFO is full. Management packets include any packet directed to the manageability console (such as RMCP and ARP packets). 
rx_priorityX_droppedNumber of dropped packets received per UPwhere X is 0 to 7
if_rx_errorsrx_crc_errorsCounts the number of receive packets with CRC errors. In order for a packet to be counted in this register, it must be 64 bytes or greater (from <Destination Address> through <CRC>, inclusively) in length. 
rx_errorsNumber of errors received 
rx_fcoe_crc_errors

FC CRC Count.
Count the number of packets with good Ethernet CRC and bad FC CRC

 
rx_fcoe_mbuf_allocation_errorsNumber of fcoe Rx packets dropped due to lack of descriptors. 
rx_fcoe_no_direct_data_placement  
rx_fcoe_no_direct_data_placement_ext_buff  
rx_fragment_errorsNumber of receive fragment errors (frame shorted than 64 bytes from <Destination Address> through <CRC>, inclusively) that have bad CRC (this is slightly different from the Receive Undersize Count register).  
rx_illegal_byte_errorsCounts the number of receive packets with illegal bytes errors (such as there is an illegal symbol in the packet). 
rx_jabber_errorsNumber of receive jabber errors. This register counts the number of received packets that are greater than maximum size and have bad CRC (this is slightly different from the Receive Oversize Count register). The packets length is counted from <Destination Address> through <CRC>, inclusively.  
rx_length_errorsNumber of packets with receive length errors. A length error occurs if an incoming packet length field in the MAC header doesn't match the packet length. 
rx_mbuf_allocation_errorsNumber of Rx packets dropped due to lack of descriptors. 
rx_oversize_errorseceive Oversize Error. This register counts the number of received frames that are longer than maximum size as defined by MAXFRS.MFS (from <Destination Address> through <CRC>, inclusively) and have valid CRC.  
rx_priorityX_mbuf_allocation_errorsNumber of received packets per UP dropped due to lack of descriptors.where X is 0 to 7
rx_q0_errorsNumber of errors received for the queue.if more queues are allocated then you get the errors per Queue
rx_undersize_errorsReceive Undersize Error. This register counts the number of received frames that are shorter than minimum size (64 bytes from <Destination Address> through <CRC>, inclusively), and had a valid CRC. 
if_rx_octetsrx_error_bytesCounts the number of receive packets with error bytes (such as there is an error symbol in the packet). This registers counts all packets received, regardless of L2 filtering and receive enablement.bug - will move this to errors
rx_fcoe_bytesnumber of received fcoe bytes 
rx_good_bytesGood octets/bytes received count. This register includes bytes received in a packet from the <Destination Address> field through the <CRC> field, inclusively. 
rx_q0_bytesNumber of bytes received for the queue.per queue
rx_total_bytesTotal received octets. This register includes bytes received in a packet from the <Destination Address> field through the <CRC> field, inclusively. 
if_rx_packetsrx_broadcast_packetsNumber of good (non-erred) broadcast packets received. 
rx_fcoe_packetsNumber of FCoE packets posted to the host. In normal operation (no save bad frames) it equals to the number of good packets. 
rx_flow_control_xoff_packetsNumber of XOFF packets received. This register counts any XOFF packet whether it is a legacy XOFF or a priority XOFF. Each XOFF packet is counted once even if it is designated to a few priorities. 
rx_flow_control_xon_packetsNumber of XON packets received. This register counts any XON packet whether it is a legacy XON or a priority XON. Each XON packet is counted once even if it is designated to a few priorities. 
rx_good_packetsNumber of good (non-erred) Rx packets (from the network). 
rx_management_packetsNumber of management packets received. This register counts the total number of packets received that pass the management filters. Management packets include RMCP and ARP packets. Any packets with errors are not counted, except for the packets that are dropped because the management receive FIFO is full are counted. 
rx_multicast_packetsNumber of good (non-erred) multicast packets received (excluding broadcast packets). This register does not count received flow control packets.  
rx_priorityX_xoff_packetsNumber of XOFF packets received per UPwhere X is 0 to 7
rx_priorityX_xon_packetsNumber of XON packets received per UPwhere X is 0 to 7
rx_q0_packetsNumber of packets received for the queue.per queue
rx_size_1024_to_max_packetsNumber of packets received that are 1024-max bytes in length (from <Destination Address> through <CRC>, inclusively). This registers does not include received flow control packets. The maximum is dependent on the current receiver configuration and the type of packet being received. If a packet is counted in receive oversized count, it is not counted in this register. Due to changes in the standard for maximum frame size for VLAN tagged frames in 802.3, packets can have a maximum length of 1522 bytes. 
rx_size_128_to_255_packetsNumber of packets received that are 128-255 bytes in length (from <Destination Address> through <CRC>, inclusively). 
rx_size_256_to_511_packetsNumber of packets received that are 256-511 bytes in length (from <Destination Address> through <CRC>, inclusively). 
rx_size_512_to_1023_packetsNumber of packets received that are 512-1023 bytes in length (from <Destination Address> through <CRC>, inclusively). 
rx_size_64_packetsNumber of good packets received that are 64 bytes in length (from <Destination Address> through <CRC>, inclusively). 
rx_size_65_to_127_packetsNumber of packets received that are 65-127 bytes in length (from <Destination Address> through <CRC>, inclusively) 
rx_total_missed_packetsthe total number of rx missed packets, that is is a packet that was correctly received by the NIC but because it was out of descriptors and internal memory, the packet had to be dropped by the NIC itself 
rx_total_packetsNumber of all packets received. This register counts the total number of all packets received. All packets received are counted in this register, regardless of their length, whether they are erred, but excluding flow control packets. 
rx_xoff_packetsNumber of XOFF packets received. Sticks to 0xFFFF. XOFF packets can use the global address or the station address. This register counts any XOFF packet whether it is a legacy XOFF or a priority XOFF. Each XOFF packet is counted once even if it is designated to a few priorities. If a priority FC packet contains both XOFF and XON, only this counter is incremented. 
rx_xon_packetsNumber of XON packets received. XON packets can use the global address, or the station address. This register counts any XON packet whether it is a legacy XON or a priority XON. Each XON packet is counted once even if it is designated to a few priorities. If a priority FC packet contains both XOFF and XON, only the LXOFFRXCNT counter is incremented. 
if_tx_errorstx_errorsTotal number of TX error packets 
if_tx_octetstx_fcoe_bytesNumber of fcoe bytes transmitted 
tx_good_bytescounter of successfully transmitted octets. This register includes transmitted bytes in a packet from the <Destination Address> field through the <CRC> field, inclusively. 
tx_q0_bytesNumber of bytes transmitted by the queue.per queue
if_tx_packetstx_broadcast_packetsNumber of broadcast packets transmitted count. This register counts all packets, including standard packets, secure packets, FC packets and manageability packets 
tx_fcoe_packetsNumber of fcoe packets transmitted 
tx_flow_control_xoff_packetsLink XOFF Transmitted Count 
tx_flow_control_xon_packetsLink XON Transmitted Count 
tx_good_packetsNumber of good packets transmitted 
tx_management_packetsNumber of management packets transmitted. 
tx_multicast_packetsNumber of multicast packets transmitted. This register counts the number of multicast packets transmitted. This register counts all packets, including standard packets, secure packets, FC packets and manageability packets. 
tx_priorityX_xoff_packetsNumber of XOFF packets transmitted per UPwhere X is 0 to 7
tx_priorityX_xon_packetsNumber of XON packets transmitted per UPwhere X is 0 to 7
tx_q0_packetsNumber of packets transmitted for the queue. A packet is considered as transmitted if it is was forwarded to the MAC unit for transmission to the network and/or is accepted by the internal Tx to Rx switch enablement logic. Packets dropped due to anti-spoofing filtering or VLAN tag validation (as described in Section 7.10.3.9.2) are not counted.per queue
tx_size_1024_to_max_packetsNumber of packets transmitted that are 1024 or more bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets. 
tx_size_128_to_255_packetsNumber of packets transmitted that are 128-255 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets 
tx_size_256_to_511_packetsNumber of packets transmitted that are 256-511 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets. 
tx_size_512_to_1023_packetsNumber of packets transmitted that are 512-1023 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets. 
tx_size_64_packetsNumber of packets transmitted that are 64 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, FC packets, and manageability packets. 
tx_size_65_to_127_packetsNumber of packets transmitted that are 65-127 bytes in length (from <Destination Address> through <CRC>, inclusively). This register counts all packets, including standard packets, secure packets, and manageability packets. 
tx_total_packetsNumber of all packets transmitted. This register counts the total number of all packets transmitted. This register counts all packets, including standard packets, secure packets, FC packets, and manageability packets. 
tx_xoff_packetsNumber of XOFF packets transmitted 
tx_xon_packetsNumber of XON packets transmitted 
operationsflow_director_added_filtersThis field counts the number of added filters to the flow director filters logic. 
flow_director_matched_filtersThis field counts the number of matched filters to the flow director filters logic. 
flow_director_missed_filtersThis field counts the number of missed filters to the flow director filters logic. 
flow_director_removed_filtersThis field counts the number of removed filters from the flow director filters logic.  
pcie


correctable


non_fatal

Notification (Warning) in case of PCIe correctable error occurrence. Message contains short error description.

 
 uncorrectable fatal Notification (Failure) in case of PCIe uncorrectable fatal error occurrence. Message contains short error description. 
  non_fatalNotification (Warning) in case of PCIe uncorrectable non-fatal error occurrence. Message contains short error description. 
mcelog


errors


corrected_memory_errors

The total number of hardware errors that were corrected by the hardware (e.g. using a single bit data corruption that was correctible using ECC). These errors do not require immediate software actions, but are still reported for accounting and predictive failure analysis.

Memory (RAM) errors are among the most common errors in typical server systems. They also scale with the amount of memory: the more memory the more errors. In addition large clusters of computers with tens or hundreds (or sometimes thousands) of active machines increase the total error rate of the system.
http://www.mcelog.org/memory.html
uncorrected_memory_error

the total number of uncorrected hardware errors detected by the hardware. Data corruption has occurred. These errors require software reaction.

corrected_memory_errors_in_%sThe total number of hardware errors that were corrected by the hardware in a certain period of timewhere %s is a timed period like 24 hours
http://www.mcelog.org/memory.html 
uncorrected_memory_errors_in_%sthe total number of uncorrected hardware errors detected by the hardware in a certain period of timewhere %s is a timed period like 24 hours
http://www.mcelog.org/memory.html 

 

Events

Where collectd is runningPluginTypeType InstanceSeverityDescriptioncomment
host/guestovs_eventsgaugelink_status

Warning on Link Status Down
Info on link Status Up

 

Link status of the OvS interface: UP or DOWN 
hostmcelog


errors

 

Failure on failure to connect to the mcelog socket/ if connection is lost

OK on connection to mcelog socket

Warning for Corrected Memory Errors

Failure for Uncorrected Memory Errors

Reports Corrected and Uncorrected DIMM Failures 
host/guestdpdk_events 

link_status

   
 keep_alive   
The information comes mainly from
  * /proc/PID/status, /proc/PID/psinfo and /proc/PID/usage
  • No labels