Anuket Project

est Environment details:

  • Bare Metal,  Ubuntu 16.04.2 LTS

Repo/branch used:

Tests precondition:

  • Mcelog installed.
  • mce-inject tool installed.
  • Collectd installed.
  • Exec/python collectd plugin configured.

RAS Other

Collectd configuration (default):

LoadPlugin mcelog

#<Plugin mcelog>
# McelogClientSocket "/var/run/mcelog-client"
# McelogClientSocketEnabled true
# <McelogLogfile "/var/log/mcelog">
#   <Match>
#     Name "DISCLAIMER"
#     Regex "(Hardware event.*)"
#     Excluderegex "kernel"
#     IsMandatory true
#   </Match>
#   <Match>
#     Name "MCE details"
#     Regex "(.*)"
#     SubmatchIdx 0
#     Excluderegex "kernel|Hardware event|TIME|CPUID"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "ORIGIN"
#     Regex "MCA: (.*)[ _][Ee][Rr]{2}"
#     SubmatchIdx 1
#     Excluderegex "kernel|Hardware event|TIME|CPUID|No Error"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "TIME"
#     Regex "TIME ([0-9]*)"
#     Excluderegex "kernel"
#     IsMandatory false
#   </Match>
#   <Match>
#     Name "CPUID"
#     Regex "CPUID (Vendor.*)"
#     Excluderegex "kernel"
#     IsMandatory true
#   </Match>
# </McelogLogfile>
# McelogLogfileEnabled true
#</Plugin> 

Table#1: RAS IO test cases

#
Test Summary
Steps
Expected
Observed 
Status
Comments
1RAS plugin notifications upon collectd start with "McelogLogfileEnabled false"
  1. Collected initial configuration.
  2. Set "McelogLogfileEnabled false". Start collectd.
  3. Verify notifications dispatched by PCIe plugin.
  4. Inject IO error: echo "CPU 0 BANK 1 STATUS 0x8800000000000E0B" | ./mce-inject

2. Collectd started.

3. Notification that mcelog is connected to server dispatched.

4. Notification is not dispatched.

 Pass 
2RAS plugin notifications upon collectd start with "McelogLogfileEnabled true"
  1. Collected initial configuration.
  2. Verify notifications dispatched by PCIe plugin.
  1. Collectd started.
  2. Notification that mcelog is connected to server dispatched.
  3. Other old notifications read from mcelog are dispatched.
 

Fail

Internal JIRA Filed

 

3RAS plugin dispatches notifications after every collectd restart
  1. Collectd initial configuration. Start collectd.
  2. Inject IO error.
  3. Restart collectd.
  4. Inject IO error (corrected):
    ./mce-inject io_err
    # cat io_err
    CPU 0 BANK 1 STATUS 0x8800000000000E0B 
  1. Collectd started.
  2. Notification about IO error is dispatched as notification.
  3. Collectd started.
  4. Notification about IO error is dispatched as notification.
 Pass 
4RAS plugin upon mcelog LoadPlugin commented
  1. Comment out mcelog part. Restart collectd.
    #LoadPlugin mcelog
    #<Plugin mcelog>
    # ...
    #</Plugin>
  2. Inject IO error.
2. No notification dispatched.2. No notification dispatched.Pass  
5RAS plugin upon mcelog Plugin commented (default)
  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    #<Plugin mcelog>
    # ...
    #</Plugin>
  2. Inject IO error.
2. Notification is dispatched with correct values for all fields.Severity:WARNING
Time:0.000
Host:silpixa00398942
Plugin:mcelog
PluginInstance:BUS
Type:gauge
TypeInstance:Corrected error
DISCLAIMER:Hardware event. This is not a software error.
MCEdetails: MCE 0
MCEdetails: CPU 0 BANK 1
MCEdetails: MISC 0
MCEdetails: MCG status:
MCEdetails: MCi status:
MCEdetails: Corrected error
MCEdetails: MCi_MISC register valid
MCEdetails: MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
MCEdetails: Running trigger `bus-error-trigger'
MCEdetails: IO MCA reported by root port 0:00:00.0
MCEdetails: Running trigger `iomca-error-trigger'
MCEdetails: STATUS 8800000000000e0b MCGSTATUS 0
MCEdetails: MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID:CPUID Vendor Intel Family 6 Model 79
GotMachine Check Exception

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
MISC 0
TIME 1492529725 Tue Apr 18 16:35:25 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
Running trigger `bus-error-trigger'
IO MCA reported by root port 0:00:00.0
Running trigger `iomca-error-trigger'
STATUS 8800000000000e0b MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79

 
6RAS plugin upon mcelog Plugin "McelogLogfile ..." part commented
  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     #<McelogLogfile "/var/log/mcelog">
     # ...
     #</McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>
  2. Inject IO error.
2. Notification is dispatched with correct values for all fields.Same as above.

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".
7RAS plugin upon mcelog Plugin Match part commented
  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
     # ...
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>
  2. Inject IO error.
2. Notification is dispatched with correct values for all fields.Same as above.

Fail

Internal JIRA Filed

SA: Time is incorrect in notification but valid in "/var/log/mcelog".
8

RAS plugin upon mcelog Plugin all fields uncommented

(same as default configuration)

  1. Uncomment default mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
      ...
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>
  2. Inject IO error.
2. Notification is dispatched with correct values for all fields.Severity:WARNING
Time:1492529930.000
Host:silpixa00398942
Plugin:mcelog
PluginInstance:BUS
Type:gauge
TypeInstance:Corrected error
DISCLAIMER:Hardware event. This is not a software error.
MCEdetails: MCE 0
MCEdetails: CPU 0 BANK 1
MCEdetails: MISC 0
MCEdetails: MCG status:
MCEdetails: MCi status:
MCEdetails: Corrected error
MCEdetails: MCi_MISC register valid
MCEdetails: MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
MCEdetails: Running trigger `bus-error-trigger'
MCEdetails: IO MCA reported by root port 0:00:00.0
MCEdetails: Running trigger `iomca-error-trigger'
MCEdetails: STATUS 8800000000000e0b MCGSTATUS 0
MCEdetails: MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID:Vendor Intel Family 6 Model 79
GotMachine Check Exception. 
Pass 
9RAS plugin upon mcelog Plugin commented/removed Match part with "IsMandatory false"
  1. Comment out mcelog part. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     # ...
       <Match>
         Name "CPUID"
         Regex "CPUID (Vendor.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
    </McelogLogfile>
    McelogLogfileEnabled true
    </Plugin>
  2. Inject IO error.
2. Notification is dispatched with correct values for all fields.

Notification:

Severity:FAILURE
Time:1492530303.353
Host:silpixa00398942
Plugin:mcelog
PluginInstance:other
Type:gauge
TypeInstance:Uncorrected error
DISCLAIMER:Hardware event. This is not a software error.
CPUID:Vendor Intel Family 6 Model 79
GotMachine Check Exception.

mcelog:

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 1
MISC 0
TIME 1492606180 Wed Apr 19 13:49:40 2017
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCA: BUS error: 0 0 Level-3 Generic Generic IO Request-did-not-timeout
Running trigger `bus-error-trigger'
IO MCA reported by root port 0:00:00.0
Running trigger `iomca-error-trigger'
STATUS 8800000000000e0b MCGSTATUS 0
MCGCAP 7000c16 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79

Fail

Internal JIRA Filed

 

SA: Looks like fields in notifications are filtered: "MCEdetails" part is missing.

Error type is different: Corrected vs Uncorrected.

"PluginInstance" is changed to "other".

Time is different: mcelog: "TIME 1492530298 Tue Apr 18 16:44:58 2017"; notification: "1492530303.353 Tue Apr 18 16:45:03 IST 2017"; (attempt#2: 16:50:49 vs Tue Apr 18 16:50:53 IST 2017)

10RAS plugin correctly reads severity of injected IO errors
  1. Collectd initial configuration. Start collectd.
  2. Inject corrected IO error.
    # ./mce-inject io_err
    # cat io_err
    CPU 0 BANK 1 STATUS 0x8800000000000E0B 
  3. Inject uncorrected non fatal IO error.
    # ./mce-inject io_uncor_err
    # cat io_uncor_err
    ?

2. Notification is dispatched with severity WARNING for corrected error.

3. Notification is dispatched with severity FAILURE for uncorrected error.

2. Notification is dispatched with severity WARNING for corrected error.

3. Notification is dispatched with severity FAILURE for uncorrected error???

2-Pass

SA: How to inject uncorrected non fatal/fatal?

11RAS plugin upon memory and IO error injection
  1. Collectd initial configuration. Start collectd.
  2. Inject corrected IO error. # ./mce-inject io_err # cat io_err CPU 0 BANK 1 STATUS 0x8800000000000E0B 
  3. Inject corrected memory error.

2. Notification is dispatched about IO error once.

3. Notification is dispatched about memory corrected error once.

2. Notification is dispatched about IO error once.

3. Notification is dispatched about memory corrected error every time interval.

Fail

Internal JIRA Filed

 
12RAS plugin events received from different mcelog location
  1. Change mcelog file location in mcelog.conf and collectd.conf. Restart mcelog, restart collectd services.
  2. Inject IO error.
  1. Mcelog, collectd are running. Collectd plugins are loaded.
  2. Notification is dispatched about IO error.
 Pass 
13RAS plugin events received from mcelog-client socket upon "McelogClientSocketEnabled false/true" is changed
  1. Change collectd.conf. Restart collectd.
    LoadPlugin mcelog
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
         Name "Host:silpixa00398942"
         Regex "(Host.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "MCE details"
         Regex "(.*)"
         SubmatchIdx 0
         Excluderegex "kernel|Hardware event|TIME|CPUID"
         IsMandatory false
       </Match>
       <Match>
         Name "Gotmemory"
         Regex "(Gotmemory.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     </McelogLogfile>
     McelogLogfileEnabled false
    </Plugin>
  2. Inject corrected memory error (memory errors is sent over socket).
  3. Change "McelogClientSocketEnabled false". Restart collectd.
  4. Inject corrected memory error (memory errors is sent over socket).
  1. Collectd started.
  2. Notification about an error is dispatched with parsing.
  3. Collectd started.
  4. Notification about an error is dispatched is without parsing.

2. Notification about an IO error is dispatched.

 

 

4. Notification about an error is not dispatched.

Fail

Internal JIRA Filed

 

 
14RAS plugin events received from mcelog file upon "McelogClientSocketEnabled false" and "McelogLogfileEnabled true"
  1. Change collectd.conf. Restart collectd.
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled false
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "MCE details"
         Regex "(.*)"
         SubmatchIdx 0
         Excluderegex "kernel|Hardware event|TIME|CPUID"
         IsMandatory false
       </Match>
       <Match>
         Name "CPUID"
         Regex "CPUID (Vendor.*)"
        Excluderegex "kernel"
         IsMandatory true
       </Match>
     </McelogLogfile>
     McelogLogfileEnabled true
    </Plugin>
  2. Inject IO error.
2. Notification about an error is dispatched (read from mcelog file). Pass 
15RAS plugin events time detection for error received from mcelog-client socket
  1. Change collectd.conf. Restart collectd.
    <LoadPlugin mcelog>
     Interval 0.005
    </LoadPlugin>
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "CPUID"
         Regex "CPUID (Vendor.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     </McelogLogfile>
     McelogLogfileEnabled false
    </Plugin>
  2. Inject corrected memory error (over socket).

2. Notification is dispatched up to 50 ms.

2. Notification is dispatched within 33 ms

Pass

 

 
16RAS plugin events time detection for error received from mcelog file
  1. Change collectd.conf. Restart collectd.
    <LoadPlugin mcelog>
     Interval 0.005
    </LoadPlugin>
    <Plugin mcelog>
     McelogClientSocket "/var/run/mcelog-client"
     McelogClientSocketEnabled true
     <McelogLogfile "/var/log/mcelog">
       <Match>
         Name "DISCLAIMER"
         Regex "(Hardware event.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
       <Match>
         Name "CPUID"
         Regex "CPUID (Vendor.*)"
         Excluderegex "kernel"
         IsMandatory true
       </Match>
     </McelogLogfile>
     McelogLogfileEnabled false
    </Plugin>
  2. Inject IO error using mce-inject tool.
  3. Repeat step#2 several times.

2. Notification is dispatched up to 50 ms.

2. Notification is dispatched within 30 msPass 
17RAS plugin upon tags configuration failures
  1. Remove one "<Match>" from initial collectd.conf. Restart collectd.
  2. Remove one "</Match>" from initial collectd.conf. Restart collectd.
  3. Duplicate "<Match>" in intial collectd.conf.
  4. Duplicate "</Match>" in intial collectd.conf.
  5. Remove "<McelogLogfile "/var/log/mcelog">" from initial collectd.conf. Restart collectd.
  6. Remove "</McelogLogfile>" from initial collectd.conf. Restart collectd.

Collectd not started.

In all cases Error is recorded in syslog with messages like "Parse error in file ..."

 Pass 
18RAS plugin upon invalid path for mcelog file and socket
  1. Edit mcelog logfile path to invalid. Restart collectd.
  2. Edit mcelog socket path to invalid. Restart collectd.

Collectd started.

  1. Error is recorded to syslog "mcelog: Cannot connect to client socket"
  2. Error is recorded to syslog "utils_tail: stat (/var/log/mcelog/mcelog) failed: Not a directory"
 Pass 

 

Table#2: RAS QPI test cases

CPU 0 BANK 2 STATUS 0x8800000000000E0F 

Table#3: RAS CPU test cases

CPU 0 BANK 1 STATUS CORRECTED PCC

 

Table#4: RAS System test cases

Under question how to inject any of System error.

  • No labels