Anuket Project

 

Manual RAS memory plugin testing:

Prerequisites:

Tests precondition:

    1. Collectd with RAS memory plugin (mcelog) is installed. Collectd plugins csv, mcelog are enabled. Mcelog service is started.
    2. Installed mcelog, error injection tools: mce-inject tools, mce-test, einj (Memory corrected errors were injected using einj.ko module).
    3. DUT's BIOS is supported by mcelog (BIOS Vendor: Intel Corp. is supported for sure).

 

Installation details:

      • Collectd, RAS memory plugin, mcelog, error injection tools installation and configuration details can be found here.
      • snmp-agent (collectd plugin) details are here.


Environment details:

E1 - Bare Metal, U16.04.

 

Repo/branch used:

 

Error injection details.

  1. Memory errors injected by mce-test(einj).
    To inject corrected memory errors:
    1. Remove sb_edac and edac_core kernel modules: rmmod sb_edac rmmod edac_core
    2. Insert einj module: modprobe einj param_extension=1
    3. Inject an error by specifying details (last command should be repeated at least two times): 
      $ APEI_IF=/sys/kernel/debug/apei/einj
      $ echo 0x8 > $APEI_IF/error_type
      $ echo 0x01f5591000 > $APEI_IF/param1
      $ echo 0xfffffffffffff000 > $APEI_IF/param2
      $ echo 1 > $APEI_IF/notrigger
      $ echo 1 > $APEI_IF/error_inject
    Check the MCE statistic: mcelog --client. Check the mcelog log for injected error details: less /var/log/mcelog.

    To inject memory uncorrected non-fatal / fatal errors just change error_type:
      $ echo 0x00000010 > $APEI_IF/error_type
    1. $ echo 0x00000020 > $APEI_IF/error_type
  2. Corrected memory errors injected by mce-inject.
    Install mce-inject as mentioned are here.
    Load mce_inject module:
        modprobe mce_inject
    Edit file:
        $ vi test/corrected
        CPU 0 BANK 0
        STATUS 0xcc00008000010090
        ADDR 0x0010FFFFFFF
    Inject an error:
        mce-inject test/corrected

  3. Uncorrected (non-fatal, without reboot) memory error injected using mce-inject and mce-test.
        $ mce-inject  mce-test/cases/coverage/soft-inj/recoverable_ucr/data/srao_mem_scrub.

Mcelog collectd section:

LoadPlugin mcelog

<Plugin mcelog>

  McelogClientSocket  "/var/run/mcelog-client"

  McelogLogfile "/var/log/mcelog"

</Plugin>

Will be changed after branch "feat_mcelog_mem_notification_level" is merged (default if all commented for now is socket):

#<Plugin mcelog>

#  <Memory>

#    McelogClientSocket "/var/run/mcelog-client"

#    PersistentNotification false

#  </Memory>

#  McelogLogfile "/var/log/mcelog"

#</Plugin>

 

  1. RAS memory general test cases and result details.

#
Test case title
Priority
Steps
Expected result
Actual result
Status
Environment
Automation result
1RAS memory plugin configurationHigh
  1. Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf.
  2. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start".
  3. Open collectd csv path, like: "collectd/csv/<DUT>/…".
  4. Stop collectd: "pkill collectd" or "service collectd stop". Comment out RAS memory collectd plugin in "collectd.conf" file (mcelog). Delete existing collectd csv files under "collectd/csv" path. Start collectd.
  5. Stop collectd: "pkill collectd" or "service collectd stop". Uncomment RAS memory collectd plugin in "collectd.conf" file (mcelog). Start collectd.
  1. File is changed.
  2. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running.
  3. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".
  4. After collectd start collectd RAS related files are not created/updated.
  5. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".
  1. File is changed.
  2. collectd is running.
  3. mcelog-SOCKET_0_CHANNEL_0_DIMM_0_DIMM_A1
    mcelog-SOCKET_0_CHANNEL_2_DIMM_0_DIMM_C1
    mcelog-SOCKET_1_CHANNEL_0_DIMM_0_DIMM_E1
    mcelog-SOCKET_1_CHANNEL_2_DIMM_0_DIMM_G1
    mcelog-SOCKET_0_CHANNEL_0_DIMM_any
    mcelog-SOCKET_0_CHANNEL_3_DIMM_0_DIMM_D1
    mcelog-SOCKET_1_CHANNEL_0_DIMM_any
    mcelog-SOCKET_1_CHANNEL_3_DIMM_0_DIMM_H1
    mcelog-SOCKET_0_CHANNEL_1_DIMM_0_DIMM_B1
    mcelog-SOCKET_0_CHANNEL_any_DIMM_any
    mcelog-SOCKET_1_CHANNEL_1_DIMM_0_DIMM_F1 mcelog-SOCKET_1_CHANNEL_any_DIMM_any
  4. Files are not updated.
  5. Files are updated with new values (timestamp and errors).
 PassE1Pass
2RAS memory plugin interval configurationHigh
  1. Open "collectd.conf" file to check the collectd update interval.
  2. Open collectd csv path, like: "collectd/csv/<DUT>/mce_log…".
  3. Change interval in "collectd.conf" to 60 (seconds). Inject few memory errors.
  4. Change interval in range 1-300 seconds.  Inject few memory errors.
  1. Find line "Interval     <number>".
  2. RAS memory collectd files exists: total and 24 hour for corrected and uncorrected errors. RAS memory collectd files are updated with interval set in "collectd.conf".
  3. RAS memory collectd files are updated every 60 seconds.
  4. RAS memory collectd files are updated every set interval.
  1. 10 seconds is by default.
  2. Timestamps are updated every 10 second.
  3. RAS memory collectd files are updated every 60 seconds.
  4. Works correct for 30, 60, 300 seconds.
 PassE1Pass
3RAS memory plugin mcelog liveness detectionHigh
  1. Verify collectd, mcelog are running.
  2. Stop mcelog service.
  3. Start mcelog service. Restart collectd if needed.
  4. Terminate mcelog (pkill mcelog).
  5. Restart mcelog service.
  6. Repeat test three times. 
  1. Collectd, mcelog are running.
  2. Service mcelog is stopped. Appropriate messages are printed to syslog with correct severity by collectd RAS memory plugin.
  3. Collectd and mcelog are running. RAS memory collectd files are updated with interval set in "collectd.conf".
  4. Service mcelog is exited. Appropriate messages are printed to syslog (TBD) with correct severity by RAS memory collectd plugin.
  5. Collectd and mcelog are running. RAS memory collectd related files are updated with interval set in "collectd.conf".
  6. RAS memory collectd plugin is stopped/started, messages about this are printed. 
  1. pidof mcelog, collectd: 207803, 207791
  2. syslog messages:
    collectd[207791]: mcelog: Connection to socket is broken
    collectd[207791]: plugin_dispatch_notification: severity = 1; message = Connection to mcelog socket is broken.; time = 1477301194.912; host = silpixa00378251;
    collectd[207791]: plugin_read_thread: Handling `mcelog'. mcelog: mcelog_read
    collectd[207791]: mcelog: MACHINE CHECK INFO NOT AVAILABLE
    collectd[207791]: plugin_read_thread: read-function of the `mcelog' plugin took 0.000027 seconds.
    collectd[207791]: plugin_read_thread: Effective interval of the `mcelog' plugin is 30.000 seconds.
    collectd[207791]: plugin_read_thread: Next read of the `mcelog' plugin at 1477301754.617.

3. RSA memory collectd files are updated with new timestamps. After error injected to DIMM any new values are recorded.

4. systemd[1]: Stopped LSB: Machine Check Exceptions (MCE) collector & decoder.

5. pidof collectd, mcelog: 209386, 209318

 

PassE1

PASS

(HAA-1195, Fixed)

4RAS memory plugin upon collectd restartHigh
  1. Enabled RAS memory plugin by uncommenting 'mcelog' related lines in collectd.conf.
  2. Start collectd: "./collectd -C collectd.conf -f" or start as a service "service collectd start".
  3. Open collectd csv path, like: "collectd/csv/<DUT>/…".
  4. Stop collectd: "pkill collectd" or "service collectd stop". 
  5. Repeat test three times.
  1. File is changed.
  2. Verify collectd is running: "pidof collectd" returns process ID or "service collectd status" service is running.
  3. Collectd RAS memory files exists: total and 24 hour for corrected and uncorrected errors. Collectd RAS related files are updated with interval set in "collectd.conf".
  4. Verify collectd is not running: "pidof collectd" returns nothing or "service collectd status" service is stopped. Collectd RAS related files are not updated.
  5. Collectd is functioning correctly. Collectd RAS memory related data is updated in time.
  1. Success
  2. Collectd service started, mcelog plugin init and read callback calls present in syslog.
  3. Mcelog appends data to log files with defined interval.
  4. Collectd service is stopped. Logs of mcelog are not updated anymore.
  5. Repeating previous steps reproduces same behavior.

PassE1Pass
5RAS memory plugin upon corrected errors injectionHigh
  1. Inject a correctable memory errors.

    $ cat mytest/corrected
    CPU 0 BANK 0
    STATUS 0xcc00008000010090
    ADDR 0x0010FFFFFFF

    $ ./mce-inject mytest/corrected

  2. Make sure no errors are injected. Wait for while.

  3. Repeat test for other correctable memory errors.

  1. Memory error recorded to total and 24 hour files for corrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.
  3. Same as in step#1.
  1. Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed.
  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.
  3. Once new errors are injected, counting and logging occurs as expected.
PassE1Pass
6RAS memory plugin upon uncorrected non-fatal errors injection Medium
  1. Inject an uncorrectable non-fatalmemory error.
    $ cat mytest/uncorrected_nonfatal
    CPU 0 BANK 2
    STATUS UNCORRECTED SRAO 0xc0
    MCGSTATUS RIPV MCIP
    ADDR 0x1234
    MISC 0x8c
    RIP 0x73:0x1eadbabe
    $ ./mce-inject mytest/uncorrected_nonfatal
  2. Make sure no errors are injected. Wait for while.
  3. Repeat error injection three times.
  1. Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
    Note: error injection may cause a system reboot.
  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.
  3. Same as in step#1.
  1. Error is logged to /var/log/mcelog and to collectd and values are the same. Same number of memory errors are recoreded to mcelog and to collectd logfile. No server reboot observed.
  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.
  3. Once new errors are injected, counting and logging occurs as expected.
PassE1Pass
7RAS memory plugin upon uncorrected fatal errors injection Medium
  1. Inject an uncorrectable fatalmemory error.
    $ cat mytest/uncorrected_fatal
    CPU 0 BANK 2
    STATUS UNCORRECTED SRAO 0xc0
    MCGSTATUS MCIP
    ADDR 0x1234
    MISC 0x8c
    $ ./mce-inject mytest/uncorrected_fatal
  2. Check server behavior.
  3. Repeat step#1 again (same or different memory error).
  1. Memory error recorded to total and 24 hour files for uncorrected errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
    Note: error injection may cause a system reboot.
  2. No new memory errors are recorded neither to mcelog nor to csv collectd plugin files related to RAS memory.
  3. Same as in step#1.
  1. Server is rebooted. Uncorrected error is detected by mcelog, logged by collectd after server is up against correct DIMM location. Collectd files don't preserve statistic after error injected and reboot!
    Collectd files don't preserve statistic once mcelog is restarted!
  2. During idle time there's no new records in mcelog plugin log files, neither logged by mcelog itself. Unchanged value is updated to collectd with correct time interval.
  3. Once new errors are injected, counting and logging occurs as expected.
PassE1NA 
RAS memory plugin MCE detection on faulty DIMM Low
  1. Get prepared a server with faulty DIMM installed to specific slot.
  2. Wait for expected memory errors. Check for RAS memory errors in mcelog and in collectd csv files.
  3. Repeat observation for a while, overnight.
  1. Start the server.
  2. Errors are registered in mcelog log file and in "collectd/csv/" files with correct address: node#/channel#/DIMM#.
  3. Errors are detected and MCE statistic is updated.
Removed because it's difficult to check as host is continuously rebooting.InvalidE1NA
RAS memory plugin upon different Unix socket location Medium
  1. Change socket location in mcelog.conf (socket-path = /var/run/mcelog-client) and collectd.conf for mcelog plugin to other location (default: McelogClientSocket "/var/run/mcelog-client"). Restart mcelog/collectd.
  2. Inject an error and check the statistic.
  1. Configuration changed. Socket is created, mcelog/collectd are running.
  2. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
  1. Configuration changed. Socket is created, mcelog/collectd are running.
  2. Memory error recorded to total and 24 hour files for errors with correct timestamp and location (node#/channel#/DIMM#) by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd logfile.
PassE1NA 
10RAS memory plugin upon different log file locationMedium
  1. Change log file location in mcelog.conf (logfile = /var/log/newmcelog). Make sure data not updated though socket-path defined mcelog.conf.
  2. Change log file location in collectd.conf for mcelog plugin to other location (McelogLogfile "/var/log/newmcelog"). Restart collectd.
  3. Inject an error and check the statistic.
  1. Configuration changed. Log file is created, mcelog/collectd are running.
  2. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!
  1. Log file is not created under new location.
  2. Memory error notifications dispatched by exec or python plugin. Stats not updated by CSV plugin!!!

TBD

(PR's awaiting)

 E1NA
11RAS memory plugin started with "Plugin mce" section commentedHigh
  1. Comment out "<Plugin mcelog>" section in collectd.conf.
  2. Start collectd.
  3. Inject a memory error.

2. Collectd started.

Default path for socket, "McelogClientSocket" - "/var/run/mcelog-client".

Default path for log file, "McelogLogfile" - "/var/log/mcelog".

3. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

2. Collectd started. Socket is created under "/var/run/mcelog-client".

3. Mcelog reports memory error to the "/var/log/mcelog" log file, values are same as reported by collectd plugin.

PassE1NA
12RAS memory plugin started with commented fields High
  1. Comment out "McelogClientSocket" field in collectd.conf.
  2. Start collectd. Inject a memory error.
  3. Comment out  "McelogLogfile" field in collectd.conf.
  4. Restart collectd. Inject a memory error.

2.  Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

2. Collectd started. Socket file is created under "/var/run/mcelog-client" location. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

4. Collectd started. Mcelog reports memory error to the "/var/log/mcelog" log file correctly.

PassE1NA
13RAS memory plugin data updated for new period (day)Medium
  1. Start mcelog, collectd.
  2. Inject corrected, uncorrected non fatal and fatal errors.
  3. Wait for new day started.
  4. Inject corrected, uncorrected non fatal and fatal errors.

2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan.

All memory corrected/uncorrected errors for 24h timespanpreserved values for previous day, but set to zero for a new day.

4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

2. Memory error recorded to total and 24 hour files for corrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

3. All memory corrected/uncorrected errors are copied from previous day to the new day (YYYY-MM-DD) for total timespan.

All memory corrected/uncorrected errors for 24h timespan preserved values for previous day, but set to zero for a new day.

4. Memory error recorded for new day only to total and 24 hour files for corrected/uncorrected errors by collectd RAS memory plugin. Same number of memory errors are recoreded to mcelog and to collectd log file.

PassE1NA 
14RAS memory plugin data updated from emulated socket (non mcelog)Medium
  1. Configure mcelog plugin to retrieve data from other socket (collectd.conf). 
  2. Open a socket (using mcelog emulator).
  3. Start collectd (mcelog service must be stopped).
  4. Generate corrected/uncorrected errors through created socket (using mcelog emulator).

3. Collectd started.

4. Generated memory corrected/uncorrected errors are recorded correctly to specified DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated.

3. Collectd started.

4. Generated memory corrected/uncorrected errors are recorded correctly against DIMM's location. Number of corrected/uncorrected errors is same retrieved from collectd and generated.

TBDE1NA 
15RAS memory plugin events are receivedHigh
  1. Enable and configure exec plugin:
    LoadPlugin exec
    <Plugin exec>
             NotificationExec "test" "/home/test/notify.sh" </Plugin>
  2. Type in script as below (cat "/home/test/notify.sh"):
    #!/bin/bash
    while read x y
    do
        echo $x$y >> "/home/test/notifications"
    done
  3. Start collectd. Wait for few intervals.
  4. Inject memory corrected/uncorrected errors.
  5. Repeat test with time interval in range of 1 to 60 seconds.

3. Mcelog running. Collectd started without errors in syslog.

Notification(s) recorded every time interval for corrected/uncorrected memory errors.

4. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

5. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

3. Mcelog running. Collectd started without errors in syslog.

Notification(s) recorded every time interval for corrected/uncorrected memory errors.

4. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

5. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

PassE1NA
16RAS memory plugin events are received every 5-10 msHigh
  1. Enable and configure exec plugin to update data every 5 ms.
  2. Start collectd. Wait for few intervals.
  3. Inject memory corrected/uncorrected errors.
2. Mcelog running. Collectd started without errors in syslog.

Notification(s) recorded every time interval for corrected/uncorrected memory errors.

3. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

 

2. Mcelog running. Collectd started without errors in syslog.

Notification(s) recorded every time interval for corrected/uncorrected memory errors.

3. Number of notification(s) increased and is the same as retrieved by mcelog, and updated every time interval for corrected/uncorrected memory errors.

Time needed for notification been operated:

corrected - 79-116=37ms / 108-148=40ms

uncorrected - 134-160=26ms / 162-188=26ms

TBDE1NA
17RAS memory plugin events are received according to the interval with persist notification option enabled High
  1. Set PersistentNotification to True (collectd.conf)
  2. Enable and configure exec plugin:
    LoadPlugin exec
    <Plugin exec>
             NotificationExec "test" "/home/test/notify.sh" </Plugin>
  3. Type in script as below (cat "/home/test/notify.sh"):
    #!/bin/bash
    while read x y
    do
        echo $x$y >> "/home/test/notifications"
    done
  4. Start collectd. Wait for few intervals.
  5. Inject memory corrected and uncorrected errors.

4. Mcelog running. Collectd started without errors in syslog.

5. Notifications recorded every time interval for corrected and uncorrected memory errors.

4. Mcelog running. Collectd started without errors in syslog.

5. Notifications recorded every time interval for corrected and uncorrected memory errors.

Pass

(dev testing on branch feat_mcelog_mem_notification_level)

E1NA 
18RAS memory plugin events are received according to the interval with persist notification option disabled High
  1. Set PersistentNotification to False (collectd.conf)
  2. Enable and configure exec plugin:
    LoadPlugin exec
    <Plugin exec>
             NotificationExec "test" "/home/test/notify.sh" </Plugin>
  3. Type in script as below (cat "/home/test/notify.sh"):
    #!/bin/bash
    while read x y
    do
        echo $x$y >> "/home/test/notifications"
    done
  4. Start collectd. Wait for few intervals.
  5. Inject memory corrected and uncorrected errors.

4. Mcelog running. Collectd started without errors in syslog.

5. Notifications recorded only once per each error injection for corrected and uncorrected memory errors.

4. Mcelog running. Collectd started without errors in syslog.

5. Notifications recorded only once per each error injection for corrected and uncorrected memory errors.

Pass

(dev testing on branch feat_mcelog_mem_notification_level)

E1NA 
19RAS memory plugin configuration memory socket and log file are exclusiveMcelogLogfile High
  1. Enable McelogClientSocket and McelogLogfile options.
    <Plugin mcelog>
       <Memory>
         McelogClientSocket "/var/run/mcelog-client"
       </Memory>
       McelogLogfile “/var/log/mcelog"
     </Plugin>
  2. Start collectd.

2. Enabling memory socket and log file is prohibited. Error should be received and plugin exited.

2. Enabling memory socket and log file is prohibited. Error should be received and plugin exited. Collectd failed on config stage, no other plugins loaded. 
ERROR: mcelog: Invalid configuration option: "McelogLogfile", Memory option is already configured.

Pass

(dev testing on branch feat_mcelog_mem_notification_level)

 E1 NA
20RAS memory plugin notifications read from log file High
  1. Define mcelog as:
    <Plugin mcelog>
       McelogLogfile "/var/log/mcelog" </Plugin>
  2. Start collectd.

  3. Inject memory corrected and uncorrected errors.

2. Mcelog running. Collectd started without errors in syslog.

3. Notification about corrected and uncorrected errors are sent once per injection.
 TBD E1 NA
21RAS memory plugin notifications read from log file and dispatched once regardless “PersistentNotification” Medium
  1. Define mcelog as:
    <Plugin mcelog>
       <Memory>
         PersistentNotification false
       </Memory>
       McelogLogfile "/var/log/mcelog"
    </Plugin>
  2. Start collectd.
  3. Send to log file corrected/uncorrected memory errors (need to define error format).
  4. Set PersistentNotification to True in collectd.conf. Restart collectd.
2. Mcelog running. Collectd started without errors in syslog.

3. Notification about errors are sent once per injection of corrected/uncorrected error.

4. Notification about errors are sent once per injection of corrected/uncorrected error.
 TBD E1 NA
21RAS memory plugin notifications severity sent from socket High
  1. Define collectd mcelog plugin:
    <Plugin mcelog>
       <Memory>
         McelogClientSocket "/var/run/mcelog-client"
         PersistentNotification false
       </Memory>
    </Plugin>
  2. Start collectd.
  3. Inject corrected memory errors.
  4. Inject uncorrected non-fatal memory errors.

2. Mcelog running. Collectd started without errors in syslog.

3. Notification about corrected error is sent with Warning severity.

4. Notification about uncorrected error is sent with Failure severity.

 TBDE1NA
22RAS memory plugin notifications severity sent from logfile.High
  1. Define collectd mcelog plugin.
    <Plugin mcelog>
       McelogLogfile "/var/log/mcelog"
    </Plugin>
  2. Start collectd.
  3. Inject corrected memory errors.
  4. Inject uncorrected non-fatal memory errors.

2. Mcelog running. Collectd started without errors in syslog.

3. Notification about corrected error is sent with Warning severity.

4. Notification about uncorrected error is sent with Failure severity.

 TBDE1NA

 

 

 


  1. SNMP RAS memory test cases for manual execution

Q & A:

    1. Is it expected to have no previous mcelog logs after reboot?
      1. N/A
    2. Does snmp-agent plugin depend on snmpd?
      1. N/A

Manual test results:

#High level scenario descriptionSteps to be executedExpected ResultTest resultCommentsAutomated
#High level scenario descriptionSteps to be executedExpected ResultTest resultCommentsAutomated
1Positive scenario snmp-agent plugin configuration.1. Enable mcelog and snmp-agent plugins in collectd.conf. 
2. Configure snmp-agent to run in various snmp versions (v1, v2c, v3).
Collectd runs as expected with correct applied config settings for snmp-agent plugins.
Collectd service exits normally on service stop.
PASS Yes(under review)
2Negative scenario snmp-agent plugin configuration.1. Enable mcelog and snmp-agent plugins in collectd.conf. 
2. Configure snmp-agent incorrectly (list of options TBD when plugin is available).
Collectd logs error message against snmp-agent plugin.
Collectd service starts, runs and exits normally, only if no service affecting misconfiguration occured.
Else collectd fails to start, with rc=1.
   
3Verify snmp-agent plugin reports corrected errors collected by enabled mcelog plugin1. Run collectd with enabled mcelog and snmp-agent plugins.Collectd service starts and runs normally.PASS Yes(under review)
2. Get memory errors summary using mcelog utility.Get initial number of corrected errors. 
3. Get corrected memory errors number using snmpget utility, within an interval time window.Verify that initial values taken from two sources are the same. 
4. During 5 intervals, monitor if data changes.Verify that data does not change without errors injection. 
5. Inject 1 or few corrected errors.  
6. Get memory errors summary using mcelog utility.Get current number of corrected errors. 
Verify the counter difference corresponds number of injected corrected errors.
 
7. Get corrected memory errors value using snmpget utility, within an interval time window.Verify the value is same as one retrieved from mcelog. 
4Verify snmp-agent plugin reports timed out corrected errors collected by enabled mcelog plugin1. Run collectd with enabled mcelog and snmp-agent plugins.Collectd service starts and runs normally.PASS Yes(under review)
2. Get memory errors summary using mcelog utility.Get initial number of timed out corrected errors, note the date. 
3. Get timed out corrected memory errors value using snmpget utility, within an interval time window.Verify that initial values taken from two sources are the same, and belong to same date. 
4. During 5 intervals, monitor if data changes.Verify that data does not change without errors injection. 
5. Inject 1 or few corrected errors.  
6. Get memory errors summary using mcelog utility, and the corresponding date.Verify the counter difference corresponds number of timed out corrected errors for this specific date. 
7. Get timed out corrected memory errors value using snmpget utility, within an interval time window.Verify the value is same as one retrieved from mcelog. 
5Verify snmp-agent plugin reports uncorrected errors collected by enabled mcelog plugin1. Run collectd with enabled mcelog and snmp-agent plugins.Collectd service starts and runs normally.PASS Yes(under review)
2. Get memory errors summary using mcelog utility.Get initial number of uncorrected errors. 
3. Get uncorrected memory errors number using snmpget utility, within an interval time window.Verify that initial values taken from two sources are the same. 
4. During 5 intervals, monitor if data changes.Verify that data does not change without errors injection. 
5. Inject an uncorrected error.Verify that it causes system reset, but system is available again after OS restart. 
6. Get memory errors summary using mcelog utility.Verify the injected uncorrected error was logged. 
7. Get uncorrected memory errors value using snmpget utility, within an interval time window.Verify the value is same as one retrieved from mcelog. 
6Verify snmp-agent plugin reports timed out uncorrected errors collected by enabled mcelog plugin1. Run collectd with enabled mcelog and snmp-agent plugins.Collectd service starts and runs normally.PASS Yes(under review)
2. Get memory errors summary using mcelog utility.Get initial number of timed out uncorrected errors. 
3. Get timed out uncorrected memory errors number using snmpget utility, within an interval time window.Verify that initial values taken from two sources are the same. 
4. During 5 intervals, monitor if data changes.Verify that data does not change without errors injection. 
5. Inject an uncorrected error.Verify that it causes system reset, but system is available again after OS restart. 
6. Get memory errors summary using mcelog utility.Verify the injected uncorrected error was logged. 
7. Get timed out uncorrected memory errors value using snmpget utility, within an interval time window.Verify the value is same as one retrieved from mcelog. 
7Verify snmp-agent plugin behavior when snmpd service is stopped

1. Run collectd with enabled mcelog and snmp-agent plugins.

2. Stop snmpd: service snmpd stop

User cant sent snmpwalk

Message appears in log/syslog :"Warning: Failed to connect to the agentx master agent ([NIL])"

PASS  
8Verify that snmp-agent plugin does not report any data when mcelog plugin is disabled1. Run collectd with enabled snmp-agent and disabled mcelog plugin.Collectd service starts and runs normally.PASS Yes(under review)
2. Get memory errors summary using mcelog utility.Get initial number of timed out uncorrected errors.  
3. Get memory errors number using snmpwalk utility, within an interval time window.Verify that no data is returned, but only an error message.  
9Verify correct behavior of snmp-agent collectd plugin when mcelog plugin is enabled but mcelog service is stopped

1. Stop mcelog: service mcelog stop

2. Run collectd with enabled snmp-agent and mcelog plugin.

 

Error raises: "Failed to connect to mcelog server. Connection refused"

PASS Yes(under review)
10Verify correct behavior of snmp-agent collectd plugin when mcelog plugin is enabled but mcelog service is restarted

1. Run collectd with enabled snmp-agent and mcelog plugin.

2. Restart mcelog service.

3. Trigger another count of errors.

4. Verify that mcelog snmp values are equal to triggered errors count.

Mcelog snmp values are equal to triggered errors count.PASS Yes(under review)
11Verify snmp-agent plugin behavior when snmpd service is restarted.
  1. Run collectd with enabled snmp-agent and mcelog plugin.
  2. Trigger another errors using mce-inject tool.
  3. Restart snmpd service
  4. Verify that snmp_agent is resumed.
TBDTBD  
  • No labels