collectd process plugin too slow - centos

We have quite a lot of Centos 7 and a few Centos 6 servers in our simulation farm, and installed collectd for monitoring them. There is a problem with the processes plugin that only manifests on the Centos 7 machines. When I look in the log file I get the following
# grep "read-function of the" /var/log/collectd.log | grep 2022-07-07
[2022-07-07 10:07:31] plugin_read_thread: read-function of the `processes' plugin took 3.130 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 10:07:49] plugin_read_thread: read-function of the `processes' plugin took 6.216 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:36:32] plugin_read_thread: read-function of the `processes' plugin took 45.422 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:38:36] plugin_read_thread: read-function of the `processes' plugin took 46.150 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:42:47] plugin_read_thread: read-function of the `processes' plugin took 49.933 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:44:28] plugin_read_thread: read-function of the `processes' plugin took 46.702 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:48:42] plugin_read_thread: read-function of the `processes' plugin took 50.039 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 11:49:28] plugin_read_thread: read-function of the `processes' plugin took 40.787 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
[2022-07-07 13:22:15] plugin_read_thread: read-function of the `processes' plugin took 19.180 seconds, which is above its read interval (3.000 seconds). You might want to adjust the `Interval' or `ReadThreads' settings.
This is a sample from a typical day. The strange thing is that although I have a 3 second interval, this happens only in no more than 20 measurements in a day (28.8K measurements per 24 hours). The delay in the messages can reach 150 seconds or more in rare cases. Since I am monitoring the servers, I cannot see particularly high cpu load, which never exceeds 50% cpu utilisation from any category (user,system,wait,interrupt) when those issues occur. Also free memory is never a problem. This particular machine is a reasonably powerful 2 socket (24 cores each) Xeon Gold 6248R with 762GB RAM
The main configuration
# Global settings
Interval 3
CheckThresholds true
CollectInternalStats true
Timeout 10
ReadThreads 10
WriteThreads 10
WriteQueueLimitHigh 4000
WriteQueueLimitLow 4000
FQDNLookup False
# Plugin loading
LoadPlugin logfile
LoadPlugin threshold
# Plugin Configuration
<Plugin "logfile">
LogLevel "info"
File "/var/log/collectd.log"
Timestamp true
</Plugin>
Include "/etc/collectd.d/*.conf"
The processes configuration is the following
LoadPlugin processes
<Plugin "processes">
ProcessMatch "lsf" "lsf.*lim"
ProcessMatch "collectd" "collectd"
</Plugin>
<Plugin "threshold">
<Plugin "processes">
Instance "lsf"
<Type "ps_count">
DataSource "processes"
FailureMin 1
</Type>
Instance "collectd"
<Type "ps_count">
DataSource "processes"
FailureMin 1
</Type>
</Plugin>
</Plugin>
In total I use between 11 to 13 read plugins (cpu, df, interface, load, memory, processes, swap, uptime, users, logfile, threshold and python or perl in some systems), and one write module (write_riemann). The messages appear only for the processes plugin
My hope was that if I could speed up the execution of the plugin, the issue will go away, but start to realise this is not enough. Initially I used 5 read threads, but have now increased it to 10. Also I started with the EPEL provided packages of collectd 5.8, but later recompiled from to 5.12 with the -O3 and newer compilers (gcc11). Finally I start the collectd service with nice -n -20 to give maximum priority. There has been an improvement with those changes, but not enough to make the messages disappear.
The reason I need the delay to be kept at minimum is because if I do not receive an update on the processes I monitor for 60 seconds, I generate an alert. I thought of increasing the downtime before an alert to 2 minutes, but I still get false positives at least 1 per day from all 300 servers. Rising it further will improve the situation but I am wondering if this is the right approach.
Any suggestions on how to debug this issue are more than welcome

Related

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

Fail2Ban not working on Ubuntu 16.04 (Date issues)

I have a problem with Fail2Ban
2018-02-23 18:23:48,727 fail2ban.datedetector [4859]: DEBUG Matched time template (?:DAY )?MON Day 24hour:Minute:Second(?:\.Microseconds)?(?: Year)?
2018-02-23 18:23:48,727 fail2ban.datedetector [4859]: DEBUG Got time 1519352628.000000 for "'Feb 23 10:23:48'" using template (?:DAY )?MON Day 24hour:Minute:Second(?:\.Microseconds)?(?: Year)?
2018-02-23 18:23:48,727 fail2ban.filter [4859]: DEBUG Processing line with time:1519352628.0 and ip:158.140.140.217
2018-02-23 18:23:48,727 fail2ban.filter [4859]: DEBUG Ignore line since time 1519352628.0 < 1519381428.727771 - 600
It says "ignoring Line" because the time skew is greater than the inspection period. However, this is not the case.
If indeed 1519352628.0 is derived from Feb 23, 10:23:48, then the other date: 1519381428.727771 must be wrong.
I have run tests for 'invalid user' hitting this repeatedly. But Fail2ban is always ignoring the line.
I am positive I am getting Filter Matches within 1 second.
This is Ubuntu 16.04 and Fail2ban 0.9.3
Thanks for any help you might have!
Looks like there is a time zone issue on your machine that might cause the confusion. Try to set the correct time zone and restart both rsyslogd and fail2ban.
Regarding your debug log:
1519352628.0 = Feb 23 02:23:48
-> timestamp parsed from line in log file with time Feb 23 10:23:48 - 08:00 time zone offset!
1519381428.727771 = Feb 23 10:23:48
-> timestamp of current time when fail2ban processed the log.
Coincidently this is the same time as the time in the log file. That's what makes it so confusing in this case.
1519381428.727771 - 600 = Feb 23 10:13:48
-> limit for how long to look backwards in time in the log file since you've set findtime = 10m in jail.conf.
Fail2ban 'correctly' ignores the log entry that appears to be older than 10 minutes, because of the set time zone -08:00.
btw:
If you need IPv6 support for banning, consider upgrading fail2ban to v0.10.x.
And there is also a brand new fail2ban version v0.11 (not yet marked stable, but running without issue for 1+ month on my machines) that has this wonderful new auto-increment bantime feature.

Is it possible to get rid of unwanted console messages in eclipse

Messages in my eclipse (testNG & selenium) projects are getting so much now that wanted outputs from sysout command are getting lost between them. This started recently. They were never these much. I heard there is a way to get rid of the unwanted warning messages or reduce them, at least. How can this be possibly achieved?
I get duplicates of messages like:
[BaseMessageSender] Connection established, starting reader thread
[BaseMessageSender] ReaderThread waiting for an admin message
[JsonMessageSender] Sending message [GenericMessage ==> suiteCount:1,
testCount:1]
TestNG] Time taken by org.testng.reporters.jq.Main#11531931: 80 ms
[TestNG] Time taken by org.testng.reporters.EmailableReporter2#35bbe5e8: 19 ms
[TestNG] Time taken by org.testng.reporters.XMLReporter#3f0ee7cb: 7 ms
TestNG] Time taken by [FailedReporter passed=0 failed=0 skipped=0]: 1 ms
[Utils] Attempting to create C:\Filepath\....
right click on the file you want to run ,then click on the Run Configurations then make sure verbose and Debug checkbox is unchecked .
image link

Kitura slow or low request per second?

I've download Kitura 0.20 and created a new project for a benchmark on a swift build -c release
import Kitura
let router = Router()
router.get("/") {
request, response, next in
response.send("Hello, World!")
next()
}
Kitura.addHTTPServer(onPort: 8090, with: router)
Kitura.run()
and the score appear to be low compare to Zewo and Vapor which could hit 400k+ request/s?
MacBook-Pro:hello2 yanli$ wrk -t1 -c100 -d30 --latency http://localhost:8090
Running 30s test # http://localhost:8090
1 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 415.36us 137.54us 3.09ms 91.11%
Req/Sec 5.80k 2.47k 7.19k 85.71%
Latency Distribution
50% 391.00us
75% 443.00us
90% 513.00us
99% 0.93ms
16229 requests in 30.01s, 1.67MB read
Socket errors: connect 0, read 342, write 55, timeout 0
Requests/sec: 540.84
Transfer/sec: 57.04KB
I suspect you are running out of ephemeral ports. Your issue is probably the same as this one: 'ab' program freezes after lots of requests, why?
Kitura currently does not support HTTP keepalive, and so every request requires a new connection. One symptom of this is that regardless of how many seconds you attempt to drive load, you'll see a similar number of completed requests (16229 in your example).
On OS X, there are 16,384 ephemeral ports available by default, and these will be rapidly exhausted unless you tune the network settings.
[1] http://danielmendel.github.io/blog/2013/04/07/benchmarkers-beware-the-ephemeral-port-limit/
[2] https://rolande.wordpress.com/2010/12/30/performance-tuning-the-network-stack-on-mac-osx-10-6/
My approach has been to reduce the Maximum Segment Lifetime tunable (which defaults to 15000, or 15 seconds) and increase the range of available ports temporarily while benchmarking, for example:
sudo sysctl -w net.inet.tcp.msl=1000
sudo sysctl -w net.inet.ip.portrange.first=32768
<run benchmark>
sudo sysctl -w net.inet.tcp.msl=15000
sudo sysctl -w net.inet.ip.portrange.first=49152

Windows Task Scheduler incorrectly spawns multiple instances per trigger (but need to be able to run multiple instances in parallel)

I have a task set to run an executable every 30 minutes via Windows Task Scheduler (OS: 64bit Windows Server Standard with SP2). This task needs to be able to run multiple instances of itself simultaneously, so this setting is selected: "If the task is already running, then the following rule applies: Run a new instance in parallel". (Reason: the task processes records in a queue table which may be empty or contain hundreds of thousands of records. Each task instance reserves a chunk of records to work on, so the instances won't collide)
The problem is, the task spawns multiple NEW instances at each trigger interval. It should only fire ONE NEW instance every 30 minutes. Often it spawns 2, 3, 4 or more new instances. At this point, the executable can handle the duplicate new instances without significant errors, but the server is doing more work than it needs to, and it just bugs me to no end that the task scheduler is misbehaving in this way. Here is what I have tried so far to fix:
Deleted and recreated the task (many times)
Rebooted the server
Installed this hotfix: http://support.microsoft.com/en-us/kb/2461249
Set to run every 30 minutes indefinitely
Set to run every 30 minutes daily, for duration of one day
Set "Synchronize across time zones" = true
Set "Run with highest privileges" = true
Set "Delay task for up to random delay of [X] seconds" = false (multiple new - instances are spawned all within the same second)
Set "Delay task for up to random delay of [30] seconds" = true (instead of firing during the same second, multiple new instances fire within a 30 second span)
Set "If the task fails, restart every 1 minute" = true
Set "If the task fails, restart every 1 minute" = false
Set "Run task as soon as possible after a scheduled start is missed" = false (if set to true, the problem is worse)
Even more puzzling: some other tasks on this server have the same or similar settings and do not have this problem. They had the problem before the hotfix, but after the hotfix it has been rare. Except for with this one task. What on earth could be the problem?
The exported task settings are below (with XXXX replacing sensitive info). I compared this point for point with another similar task that is not having the issue. The only differences: the working task has a different author, a different exe file, and runs every 5 minutes instead of every 30 minutes.
I'm about to chalk this up as a bug that Microsoft needs to fix some day, but thought I'd offer it up for review here before giving up.
<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
<RegistrationInfo>
<Date>2014-08-27T10:09:33.7980839</Date>
<Author>XXXX\XXXX</Author>
<Description>Links newly downloaded images to products. Resizes and uploads different sizes to XXXX. Updates relevant tables. Logs errors.</Description>
</RegistrationInfo>
<Triggers>
<CalendarTrigger>
<Repetition>
<Interval>PT30M</Interval>
<Duration>P1D</Duration>
<StopAtDurationEnd>false</StopAtDurationEnd>
</Repetition>
<StartBoundary>2015-02-11T19:06:00Z</StartBoundary>
<Enabled>true</Enabled>
<RandomDelay>PT30S</RandomDelay>
<ScheduleByDay>
<DaysInterval>1</DaysInterval>
</ScheduleByDay>
</CalendarTrigger>
</Triggers>
<Principals>
<Principal id="Author">
<UserId>XXXX\XXXX</UserId>
<LogonType>Password</LogonType>
<RunLevel>HighestAvailable</RunLevel>
</Principal>
</Principals>
<Settings>
<IdleSettings>
<Duration>PT10M</Duration>
<WaitTimeout>PT1H</WaitTimeout>
<StopOnIdleEnd>true</StopOnIdleEnd>
<RestartOnIdle>false</RestartOnIdle>
</IdleSettings>
<MultipleInstancesPolicy>Parallel</MultipleInstancesPolicy>
<DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
<StopIfGoingOnBatteries>true</StopIfGoingOnBatteries>
<AllowHardTerminate>true</AllowHardTerminate>
<StartWhenAvailable>false</StartWhenAvailable>
<RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
<AllowStartOnDemand>true</AllowStartOnDemand>
<Enabled>true</Enabled>
<Hidden>false</Hidden>
<RunOnlyIfIdle>false</RunOnlyIfIdle>
<WakeToRun>true</WakeToRun>
<ExecutionTimeLimit>P1D</ExecutionTimeLimit>
<Priority>7</Priority>
</Settings>
<Actions Context="Author">
<Exec>
<Command>C:\XXXX\XXXX\XXXX.exe</Command>
</Exec>
</Actions>
</Task>
I was in the same boat, had one task firing off 3 times, and another firing off 9 times. but also a bunch that only fired once as expected. Problem persisted after installing the hotfix as well.
After all my research turned up no good leads, my next step was going to be opening a support case with Microsoft. Before doing so I figured I would try deleting and recreating the task now that I've installed the patch. I started by only deleting and recreating the trigger (which was set to run once a day) and setting it for a different time. Bingo, my problem was fixed!
So I don't know what the key was, whether it was deleting and recreating the trigger after the patch was installed or if it was changing the time, but doing both worked.
Hope this helps!