Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Where collectd is runningPluginTypeType InstanceSeverityDescriptioncomment
host/guestovs_eventsgaugelink_statusWarning on Link Status DownLink status of the OvS interface: UP or DOWN
Severity will be configurable by the end user
 
OKAY on link Status Up
host/guestdpdk_events link_statusWarning on Link Status Down, OKAY on link status upLink status of the OvS interface: UP or DOWN
Severity will be configurable by the end user
 Depending on plugin configuration, can be dispatched as a metric or event.
 keep_aliveOKAY: if core status is ALIVE, UNUSED, DOZING, SLEEP
Warning: if core status is MISSING
Failure: if core status is DEAD or GONE
Reflects the state of DPDK packet processing coresprotects against packet processing core failures for DPDK --> no slient packet drops. Depending on plugin configuration, can be dispatched as a metric or event.
hostpciepcie_errorcorrectableNotification (Warning) in case of PCIe correctable error occurrence. Message contains short error description.

Correctable Errors include:
Receiver Error Status
Bad TLP Status
Bad DLLP Status
REPLAY_NUM Rollover
Replay Timer Timeout
Advisory Non-Fatal
Corrected Internal
Header Log Overflow

 

Uncorrectable Errors include:
Data Link Protocol
Surprise Down
Poisoned TLP
Flow Control Protocol
Completion Timeout
Completer Abort
Unexpected Completion
Receiver Overflow
Malformed TLP
ECRC Error Status
Unsupported Request
ACS Violation
Internal
MC blocked TLP
Atomic egress blocked
TLP prefix blocked

 
fatalNotification (Failure) in case of PCIe uncorrectable fatal error occurrence. Message contains short error description.
non_fatalNotification (Warning) in case of PCIe uncorrectable non-fatal error occurrence. Message contains short error description.
hostmcelog (RAS memory)errors 

Warning for Corrected Memory Errors

Failure for Uncorrected Memory Errors

Failure on failure to connect to the mcelog socket/ if connection is lost

OK on connection to mcelog socket

Reports Corrected and Uncorrected DIMM Failures
hostIPMI  OKAY - upper non-criticalEach IPMI sensor may have six different thresholds: upper non-recoverable upper critical upper non-critical lower non-critical lower critical lower non-recoverableYou may have events on a threshold sensor by specifying values (called thresholds) where you want the sensor to report an event. Then you can enable the events for the specific thresholds. Not all sensors support all thresholds, some cannot have their events enabled and others cannot have them disabled. The capabilities of a sensor may all be queried by the user to determine what it can do. When the value of the sensor goes outside the threshold an event may be generated. An event may be generated when the value goes back into the threshold
OKAY - lower non-critical
WARNING- lower critical
WARNING - upper critical
FAILURE - upper non-recoverable
FAILURE - lower non-recoverable
discrete sensor status changes are also reported out via OKAY, WARNING and FAILURE notifications.

Examples of discrete sensors can be found under the "IPMI Sensors for S2600WT2R" tab
hostmcelog RAS System, CPU, QPI, OI

(specific to a Platform) so these will change depending on what's supported by the Platform.
  WARNING - Correctable errors
FAILURE - Uncorrectable Errors
Servers based on Intel® Architecture, are generally designed for use in mission critical environments. Reliability, Availability and Serviceability (RAS) features, are integrated into the servers to address the error handling and memory mirroring and sparing required by these environments.

The goal of this  feature is to expose the RAS features provided by the Broadwell or newer platfrom to higher level fault management applications.

The Features to be exposed fall under the following catagories:
Reliability Features:
-System attributes to ensure Data integrity.
-capability to prevent, detect, correct and contain faults over a given time interval.
Availability Features:
-System attributes to help stay operational in the presence of faults in the system.
-Capability to map out failed units, ability to operate in a degraded mode.
Serviceability Features:
-System attributes to help system service, repair.
-Capability to identify failed units, and facilitates repair.


Generic Error Handling
The Silicon supports corrected, uncorrected (recoverable, unrecoverable), fatal and catastrophic error types.
Corrected Errors
Errors that are corrected by either hardware or software, corrected error information is used in predictive failure analysis by the OS.
MCA Banks corrected errors except selected memory corrected errors are handled directly by the OS. HASWELL-EP PROCESSOR triggers CMCI for the corrected errors, on CMCI OS can read the MCA Banks and collect error status. All the other platform related corrected errors can either be ignored or can be logged into BMC SEL based on platform policy.
Memory Corrected Errors
Memory corrected errors such as mirror fail over, memory read errors can be configured to trigger SMI using BIOS setup options. On memory mirror fail over BIOS logs the error for the OS as per the UEFI error record format. On memory read errors, BIOS does the following memory RAS operations in the order to correct the error.
Rank Sparing
SDDC/Device tagging
UnCorrected Non Fatal Errors
Errors that are not corrected by hardware, in general these errors trigger machine check exception and in turn triggers SMI. BIOS SMI handler logs these error information, clear the error status and pass the error log to OS. OS can recover from the error, in cases where the recovery is not an option, can trigger a system reset.
Uncorrected Fatal Errors
Errors that are neither corrected by hardware nor recovered by the s/w, the system is not in a reliable state and needs a reset to bring it back up to normal operation. In most fatal error conditions, BIOS cannot log errors before the system reset happens. All the Error status registers are sticky on the reset, BIOS collects all these information in the next boot, creates error record and pass it on the OS.
Error Logging

Example Errors are provided in the comments tab are for the Purley platform
/* See IA32 SDM Vol3B Chapter 16*/
Integrated Memory Controller Machine Check Errors
"Address parity error",
"HA write data parity error",
"HA write byte enable parity error",
"Corrected patrol scrub error",
"Uncorrected patrol scrub error",
"Corrected spare error",
"Uncorrected spare error",
"Any HA read error",
"WDB read parity error",
"DDR4 command address parity error",
"Uncorrected address parity error"
"Unrecognized request type",
"Read response to an invalid scoreboard entry",
"Unexpected read response",
"DDR4 completion to an invalid scoreboard entry",
"Completion to an invalid scoreboard entry",
"Completion FIFO overflow",
"Correctable parity error",
"Uncorrectable error",
"Interrupt received while outstanding interrupt was not ACKed",
"ERID FIFO overflow",
"Error on Write credits",
"Error on Read credits",
"Scheduler error",
"Error event",
"MscodDataRdErr",
"Reserved",
"MscodPtlWrErr",
"MscodFullWrErr",
"MscodBgfErr",
"MscodTimeout",
"MscodParErr",
"MscodBucket1Err"

Interconnect(QPI) Machine Check Errors
"UC Phy Initialization Failure",
"UC Phy detected drift buffer alarm",
"UC Phy detected latency buffer rollover",
"UC LL Rx detected CRC error: unsuccessful LLR: entered abort state",
"UC LL Rx unsupported or undefined packet",
"UC LL or Phy control error",
"UC LL Rx parameter exchange exception",
"UC LL detected control error from the link-mesh interface",
"COR Phy initialization abort",
"COR Phy reset",
"COR Phy lane failure, recovery in x8 width",
"COR Phy L0c error corrected without Phy reset",
"COR Phy L0c error triggering Phy Reset",
"COR Phy L0p exit error corrected with Phy reset",
"COR LL Rx detected CRC error - successful LLR without Phy Reinit",
"COR LL Rx detected CRC error - successful LLR with Phy Reinit"
"Phy Control Error",
"Unexpected Retry.Ack flit",
"Unexpected Retry.Req flit",
"RF parity error",
"Routeback Table error",
"unexpected Tx Protocol flit (EOP, Header or Data)",
"Rx Header-or-Credit BGF credit overflow/underflow",
"Link Layer Reset still in progress when Phy enters L0",
"Link Layer reset initiated while protocol traffic not idle",
"Link Layer Tx Parity Error"
Internal Machine Check Errors
"No Error",
"MCA_DMI_TRAINING_TIMEOUT",
"MCA_DMI_CPU_RESET_ACK_TIMEOUT",
"MCA_MORE_THAN_ONE_LT_AGENT",
"MCA_BIOS_RST_CPL_INVALID_SEQ",
"MCA_BIOS_INVALID_PKG_STATE_CONFIG",
"MCA_MESSAGE_CHANNEL_TIMEOUT",
"MCA_MSGCH_PMREQ_CMP_TIMEOUT",
"MCA_PKGC_DIRECT_WAKE_RING_TIMEOUT",
"MCA_PKGC_INVALID_RSP_PCH",
"MCA_PKGC_WATCHDOG_HANG_CBZ_DOWN",
"MCA_PKGC_WATCHDOG_HANG_CBZ_UP",
"MCA_PKGC_WATCHDOG_HANG_C3_UP_SF",
"MCA_SVID_VCCIN_VR_ICC_MAX_FAILURE",
"MCA_SVID_COMMAND_TIMEOUT",
"MCA_SVID_VCCIN_VR_VOUT_FAILURE",
"MCA_SVID_CPU_VR_CAPABILITY_ERROR",
"MCA_SVID_CRITICAL_VR_FAILED",
"MCA_SVID_SA_ITD_ERROR",
"MCA_SVID_READ_REG_FAILED",
"MCA_SVID_WRITE_REG_FAILED",
"MCA_SVID_PKGC_INIT_FAILED",
"MCA_SVID_PKGC_CONFIG_FAILED",
"MCA_SVID_PKGC_REQUEST_FAILED",
"MCA_SVID_IMON_REQUEST_FAILED",
"MCA_SVID_ALERT_REQUEST_FAILED",
"MCA_SVID_MCP_VR_ABSENT_OR_RAMP_ERROR",
"MCA_SVID_UNEXPECTED_MCP_VR_DETECTED",
"MCA_FIVR_CATAS_OVERVOL_FAULT",
"MCA_FIVR_CATAS_OVERCUR_FAULT",
"MCA_WATCHDOG_TIMEOUT_PKGC_SLAVE",
"MCA_WATCHDOG_TIMEOUT_PKGC_MASTER",
"MCA_WATCHDOG_TIMEOUT_PKGS_MASTER",
"MCA_PKGS_CPD_UNCPD_TIMEOUT",
"MCA_PKGS_INVALID_REQ_PCH",
"MCA_PKGS_INVALID_REQ_INTERNAL",
"MCA_PKGS_INVALID_RSP_INTERNAL",
"MCA_PKGS_SMBUS_VPP_PAUSE_TIMEOUT",
"MCA_RECOVERABLE_DIE_THERMAL_TOO_HOT"
hostvirtdomain_state 

OKAY:

  • VIR_DOMAIN_NOSTATE
  • VIR_DOMAIN_RUNNING
  • VIR_DOMAIN_SHUTDOWN
  • VIR_DOMAIN_SHUTOFF
Domain state and reason in a human-readable format. 

WARNING:

  • VIR_DOMAIN_BLOCKED
  • VIR_DOMAIN_PAUSED
  • VIR_DOMAIN_PMSUSPENDED

FAILURE:

  • VIR_DOMAIN_CRASHED
hostvirtfile_system OKAYFile system information (mountpoint, device name, filesystem type, number of aliases, disk aliases)Information stored in metadata. Requires Guest Agent to be installed and configured in VM.

...