Who is silently kicking the watchdog timer in Linux? [duplicate]

Who is silently kicking the watchdog timer in Linux? [duplicate] - linux-device-driver

This question already has answers here:
Who is refreshing hardware watchdog in Linux?
(5 answers)
Closed 3 years ago.
I am using the 4.9 Linux kernel on an embedded platform and noticing that the watchdog timer is getting refreshed automatically even though I have no user-space daemon running like the watchdog documentation mentions.

Taken almost verbatim from shodanex's answer to a related question.
If you enabled the watchdog driver in your kernel, the watchdog driver sets up a kernel timer in charge of resetting the watchdog.
If no application opens the /dev/watchdog file, then the kernel takes care of resetting the watchdog. Since it is a timer, it won't appear as a dedicated kernel thread, but handled by the soft IRQ thread. Now, if an application opens this file, it becomes responsible of the watchdog, and can reset it by writing to the file, as documented the watchdog documentation.
In July 2016 a commit in the 4.7 kernel to watchdog_dev.c enabled this behavior for all watchdog timer drivers. Before this time, it appears to be spotty and driver specific. The behavior doesn't seem to be documented anywhere other than this thread and the source code.
/*
* A worker to generate heartbeat requests is needed if all of the
* following conditions are true.
* - Userspace activated the watchdog.
* - The driver provided a value for the maximum hardware timeout, and
* thus is aware that the framework supports generating heartbeat
* requests.
* - Userspace requests a longer timeout than the hardware can handle.
*
* Alternatively, if userspace has not opened the watchdog
* device, we take care of feeding the watchdog if it is
* running.
*/
return (hm && watchdog_active(wdd) && t > hm) ||
(t && !watchdog_active(wdd) && watchdog_hw_running(wdd));

Related

uwsgi: detect and empty full queue

I have a python app behind supervisor and uwsgi.
At a certain point, my app stopped to answer the queries with this message in the logs:
Tue Sep 6 11:06:53 2022 - *** uWSGI listen queue of socket "127.0.0.1:8200" (fd: 3) full !!! (101/100) ***
In my use case,
If a query is not answered within 1s, the answer does not matter anymore; the client app will automatically re-do the request
restarting the whole uwsgi takes around half an hour
Thus I prefer to lose few requests than restarting.
QUESTIONS :
Is it possible to detect a full queue from inside the python app ?
Is it possible to clear the queue from inside ?
Precision: this question is not about fixing the underlying issue. I'm working on that separately. It is only about knowing if this particular workaround is possible and how to implement it.
I'm using uwsgi 2.0.20. Looking at the queue framework does not help since uwsgi has no attribute (e.g.) queue_slot. Doc outdated ?
EDIT
I can reproduce the error with this simple bash script:
#!/bin/bash
for i in {0..200}
do
echo "Number: $i"
sleep 0.2
curl -X POST "http://localhost:1103/my_app" &
done
(my app accepts POST, not GET)

Matlab Engine API OpenSingleUseFunction() and OpenEngineFunction() timeouts in 2 minutes

When I launch the Matlab tool in Windows machine, the tool launches but it is in "Initializing" phase for 4-5 minutes. The tool will not respond to any user commands.
The issue could be due to usage of remote license.
The usage of remote license or modifying anything in remote server is not possible right now.
After 4-5 minutes of initialization the Matlab tool works fine and I am ok with this behavior.
But the real problem is while launching Matlab tool using engine.c APIs such as OpenSingleUseFunction() or OpenEngineFunction(), the tool is launching and then goes to idle state.
Since Matlab is in "Initializing" phase for 4-5 minutes, the engine terminates the Matlab session after 2 minutes timeout and returns a MatlabEngine nullptr.
The 2 minutes timeout is not given by user to the engine APIs.
So is there any way to change this timeout value in engine APIs?

Regulator configuration for RTC backup battery in i.MX6 PMIC

I'm switching with the phycore i.MX6 som from phytec's dev kit to an own board. The usermanuals for both the som and devkit can be found on phytec's page. Now I want to configure the rtc to keep the time during reboot's and poweroffs.
The battery (in my case supercap) is connected to the VDD_BAT pin of the phycore i.MX6 som (page 10). The internal PMIC is the da9062 connected via the i2c bus which is configured in the som dtsi file as rtc1.
imx6qdl-phytec-phycore-som.dtsi:
...
aliases {
rtc1 = &da9062_rtc;
};
...
&i2c3 {
pmic#58 {
da9062_rtc: rtc {
compatible = "dlg,da9062-rtc";
};
};
};
This file I didn't touch at all.
Next, I told the kernel to take his hwclock and systime time from rtc1 instead of rtc0:
CONFIG_RTC_HCTOSYS_DEVICE="rtc1"
CONFIG_RTC_SYSTOHC_DEVICE="rtc1"
The driver is being loaded correctly as far as I can tell:
dmesg | grep rtc
[ 2.489836] da9063-rtc da9062-rtc: rtc core: registered da9063-rtc as rtc1
[ 2.499713] snvs_rtc 20cc000.snvs:snvs-rtc-lp: rtc core: registered 20cc000.snvs:snvs-rtc-lp as rtc2
[ 3.260348] da9063-rtc da9062-rtc: setting system clock to 2000-01-01 02:37:55 UTC (946694275)
and
cat /sys/class/rtc/rtc1/name
da9063-rtc da9062-rtc
Now, I can set the time via date and transfer it to the hwclock via
hwclock --systohc
.
After rebooting the system and hwclock is set to the previously set date which is fine. After cutting the power the clock gets reset.
I've measured the voltage of the supercap which is around 220mV. The datasheet of the da9062 tells me the chip does have an regulator for the battery which needs to be configured (Table 127: BBAT_CONT (0x0C5)).
As far as I understand the kernel/rtc subsystem, the driver for the rtc should take care of the charging of the battery or provide an userspace interface so I can do it myself. But I can't find anything on this topic.
I am using yocto to build the kernel/image for my board.
Is there something I'm missing or do I need to patch the driver myself in order to charge the supercap? Maybe there's an option in the devicetree to set the charging voltage and current for the cap?
I appreciate any ideas and suggestions, thanks.

Aparently the driver does not support charging a battery/supercap out of the box and it has exclusive access rights to the i2c device address which prevents userspace applications to access the device.
My solution to this problem is to set those values before the driver takes over:
Since this i2c bus is already configured in my barebox devicetree, I can access it before I boot the kernel (provided barebox is compiled with the i2c subsystem enabled in menuconfig). Here I can run a script which sets the BBAT and PD registers to enable charging the supercap.
Though, the cleaner solution would be to extend the driver and provide a userspace interface for this functionality.
Another possible solution I did not investigate would be to check if the driver can be compiled as a module, so I could unload the module, set the registers and load it again.

Attempt to access remote folder mounted with CIFS hangs when disconnected

This question is an extension for that question.
Yet again: I'm working under CentOS 6.0 and I have a remote win7 folder, mounted with:
mount -t cifs //PC128/mnt /media/net -o "username=WORKGROUP\user,password=pwd,rw,noexec,soft,uid=user,gid=user"
When remote folder is not available (e.g. network cable is pulled out) an attempt to access the remote folder locks an application I'm working on. At first I detected that QDir::exists() caused locking for 20-90 seconds (I still can't find out why such difference), further I detected that any call to stat() function leads to application lock.
I followed an advice provided in topic above, I moved QDir::exists() call (and later - call to the stat() function) to another thread and this didn't solve the problem. The application still hangs when connection is suddenly lost. Qt trace shows that lock is somewhere in the kernel:
0 __kernel_vsyscall
1 __xstat64#GLIBC_2.1 /lib/libc.so.6
2 QFSFileEnginePrivate::doStat stat.h
I did also tried to check if remote share is still mounted before trying to access folder itself, but it didn't help. Approaches such as:
mount | grep /media/net
show that shared folder is still mounted even is there is no active connection to the network.
Checking folder status differences such as:
stat -fc%t:%T /media/net/ != stat -fc%t:%T /media/net/..
also hangs for ~20 seconds.
So I have several questions:
Is there any way to change CIFS timeouts? I did try to find out but it seems that there is no appropriate parameters and no CIFS config.
How can I check if remote folder is still mounted and not get locked?
How can I check is folder exists and also not get locked?

Your problem: "An unreachable network filesystem" is a very well known example which trigger linux hung task which isn't the same of zombies process at all(killing the parent PID won't do anything)
An hung task, is task which triggered a system call that cause problem in the kernel, so that the system call never return.
The major particularity is that the task is declared in the "D" state by the scheduler which mean the program is in an uninterruptible state. This mean that you can do nothing to stop you program: You can trigger all signal to the task, it would not respond. Launching hundreds of SIGTERM/SIGKILL does nothing!
This the case whith my old kernel: when my nfs server crash, I need to reboot the client to kill the tasks using the filesystem. I compiled it a long time ago (I have still the build tree on my hdd) and during the configuration I saw this in lib/Kconfig.debug:
config DETECT_HUNG_TASK
bool "Detect Hung Tasks"
depends on DEBUG_KERNEL
default LOCKUP_DETECTOR
help
Say Y here to enable the kernel to detect "hung tasks",
which are bugs that cause the task to be stuck in
uninterruptible "D" state indefinitiley.
When a hung task is detected, the kernel will print the
current stack trace (which you should report), but the
task will stay in uninterruptible state. If lockdep is
enabled then all held locks will also be reported. This
feature has negligible overhead.
It was only proposing to detect such tash or panic on detection: I don't checked if recent kernel actually can solve the problem (It seems to be the case with your question), but I think it didn't worth enabling it.
There is second problem : normally, the detection occur after 120 seconds, but I saw also a Konfig option for this:
config DEFAULT_HUNG_TASK_TIMEOUT
int "Default timeout for hung task detection (in seconds)"
depends on DETECT_HUNG_TASK
default 120
help
This option controls the default timeout (in seconds) used
to determine when a task has become non-responsive and should
be considered hung.
It can be adjusted at runtime via the kernel.hung_task_timeout_secs
sysctl or by writing a value to
/proc/sys/kernel/hung_task_timeout_secs.
A timeout of 0 disables the check. The default is two minutes.
Keeping the default should be fine in most cases.
This also works with kernel threads: example: make a loop device to a file on a fuse filesystem. Then crash the userspace program controlling the fuse filesystem!
You should a get a Ktread which name is in the form loopX (X correspond normally to your loopback device number) HUNGing!
weblinks:
https://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work (look at the answer written by ultrasawblade)
http://www.linuxquestions.org/questions/linux-general-1/kill-a-hung-task-when-kill-9-doesn't-help-697305/
http://forums-web2.gentoo.org/viewtopic-t-811557-start-0.html
http://comments.gmane.org/gmane.linux.kernel/1189978
http://comments.gmane.org/gmane.linux.kernel.cifs/7674 (This is a case similar to yours)
In your case of the three question: you have the answer: This probably due to what is probably a well known bug in the vfs linux kernel layer! (There is no CIFS timeouts)

After much trial & error I found a solution that persists.
# vim /etc/fstab
//192.168.1.122/myshare /mnt/share cifs username=user,password=password,_netdev 0 0
The _netdev option is important since we are mounting a network device. Clients may hang during the boot process if the system encounters any difficulties with the network.
https://www.redhat.com/sysadmin/samba-windows-linux

wso2 wait loop doesn't work after restart

I've developed a pooling logic in bpel process on the WSO2 BPS 3.0.0 connected to a Postgresql 9 DB.
It looks like this:
<bpel:repeatUntil name="RepeatUntilIncidentCompleted">
<bpel:sequence name="CheckIncidentStatus">
<bpel:wait name="Wait">
<bpel:for expressionLanguage="urn:oasis:names:tc:wsbpel:2.0:sublang:xpath1.0"><![CDATA['PT1M']]></bpel:for>
</bpel:wait>
<!-- invoke a service, copy status to a vStatus variable -->
</bpel:sequence>
<bpel:condition expressionLanguage="urn:oasis:names:tc:wsbpel:2.0:sublang:xpath1.0"><![CDATA[$vStatus=36]]></bpel:condition>
I created a process instance and this loop worked fine.
Later I restarted the WSO2 BPS server. In the moment of the restart the process instance was in the loop, but after restart the loop wasn't running anymore. The process is marked as active in the carbon console.
I've added the in-memory=false property in the deploy.xml but it didn't help.
I could have missed some configuration but there also can be a persistence problem with such a loop (probably in the Apache ODE).
Does anyone know a solution to this problem? Thx in advance.

I've discovered that:
1. All sleep operations that you put in a wso2 bpel process are represented in the ode_job table. The attribute ts contains the time of wake up.
2. After restart of the bps server all delayed sleep operations aren't continued (a sleep operation is delayed when wake up time < current time - offset ).
3. After restart of the bps server all non-delayed sleep operations that are continued properly.
Now let's say that:
- You have a bpel process instance that waits in a wait operation. The wake up time is X.
- You stop the bps server, and start it again after X.
Because of 2. the process instance won't continue after restart. This includes the loop I've described earlier.
My workaround to the problem:
Everytime the wso2 bps server is restarted I execute a sql script on the database that updates the wake up attribute of the sleep operations (the ts column in the ode_job table). The wake up times are set to some near future.
I don't know if you can change the 2. / 3. behaviour by configuration. I couldn't find any documentation about it. Some code analasis is needed here. To make things worse, wso2 uses it's own apache ode branch, so you can't just update the apache ode library.
I suspect that there can be two reasons for the behaviour described in 2.:
- delayed sleep operations are droped
- delayed sleep operations are executed right after restart, but the process definitions aren't loaded yet.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse