Intel PMU Performance considerations

Basing on tests and feedback from some use cases it is known that Intel PMU plugin can cause collectd performance problems. Below is some description of possible sources of such problems and best practices to avoid those kind of issues in the future.

If all the counters have been enabled for all cores it might be an overkill and collectd is not on time to read all counters in every interval. It also depends on system load and configuration of other plugins. There is only a limited number of read threads for collectd process. Collectd does not start next read callback until previous one is finished and it may cause delayed or skipped reads. Another side of the story is collectd configuration. When we need to process high volume of metrics/events, set read intervals to low values (<5 s) or enable more write plugins, then there is a possibility to overload collectd, which can effect in dropped events or missed metrics. This situation was observed in some cases. It is hard to define exact configuration at which it could start as it is system/environment specific. Below there are recommendations to prevent it in collectd:

enable only required plugins
don`t set plugin/global interval below 5 s if not specifically needed
tweak collectd configuration if high volume of events/metrics is needed and/or low intervals are required

Some of the collectd options to tweak:

Interval – default is 10s. Global and local to plugin. Set it to reasonable value and don`t go below without any specific reason. As it was tested one read operation of plugin can take up to 3-4 seconds. Write plugins could consume additional seconds so default 10 s is reasonable value with appropriate margin.
Timeout – default is 2. It should be increased while using short intervals. Timeout multiplied by Interval defines timeframe for plugin. If we do not get new metrics within this timeframe metrics are considered missing and additional reconfiguration is performed by write plugins. Too low timeout value (too short timeframe) is one of the reasons we are getting errors described in bug report (metrics are added/removed in endless loop)
ReadThreads/WriteThreads – when we use short intervals and/or more read/write plugins (we are getting errors like read function of plugin took more time than read interval) we need to increase number of used read/write collectd threads. ReadThread to the number of enabled read plugins. WriteThreads – depends of configuration (# of write plugins, # of read plugins)

It was tested with "1s" interval without any long term drops with below setting:

Interval "1"

Timeout "10"

ReadThreads "25"

WriteThreads "25"

Summary of things to check when performance problems with intel pmu are observed:

Limit the metrics and cores in configuration of intel_pmu to only those of actual interest. Consider the impact of other plugins.
number of ReadThreads/WriteThreads
size of the write queue WriteQueueLimitHigh/WriteQueueLimitLow
Interval, Timeout

Space shortcuts

Page tree

Summary of things to check when performance problems with intel pmu are observed: