Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The publishing Mode – should really be writing somewhere else off the system – ideally some sort of time series DB… you want to minimize the impact of noise on the system
  • You isolate and pin cores appropriately
  • Footprints measurement process:
    • Measure Idle System resources usage
    • Run plugin/plugins combination - Measure System resources usage
    • Repeat tests on a busy System – or one running a workload.
    • Report results
    • Repeat with a busy system.
    • Metrics to collect:
      • Sysstat metrics
        • CPU     %user     %nice   %system   %iowait    %steal
        • Memory usage: kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree
        • Cache thrashing if any
        • IO
          • tps – Transactions per second (this includes both read and write)
          • rtps – Read transactions per second
          • wtps – Write transactions per second
          • bread/s – Bytes read per second
          • bwrtn/s – Bytes written per second
  • collectd/any other collector specific process stats if possible.
  • Application stats for the application you are running – to determine the impact of collectd/other collectors on the workload.
    • You might pick a usecase with some network traffic – to see the impact on this if any.
    • Intervals: you might want to try 1 second, 10 seconds and 60 seconds… if possible you might drop below a second. 

Process to be followed:

  1. Isolate the CPUs on the monitoring node. [ Added isolcpus option in the grub]
  2. Run collectd on the isolated CPU. [ Used taskset command to run collectd with appropriate CPU-mask]
  3. Plugins: Make collectd to monitor following metrics [CPU, Memory, Disk, Interface, IPMI, processes, libvirt, Caches, OVS, hugepages]
  4. Output: Make collectd to send metrics to influxdb running on separate node.
  5. Workload: stress-ng + iperf.
  6. Monitoring duration: 5 minutes.
  7. Frequency: 1sec, 10 seconds, 60 seconds.
  8. Collected Metrics to analyze collectd’s runtime performance  [ Used Snap to collect ‘collectd-process’ metrics and CPU and memory data]
  9. Note the iperf performance ( to study any effect on it due to collectd]
  10. Currently seeing if I can get more information from LTT-NG.

 

*** Repeat the above process for other monitoring agents ***