What to do when a linux one-wire master determines one of it's slaves is no longer present? - linux-device-driver

I'm writing a linux driver for a DS2484 I2C one wire master, and an accompanying one-wire slave driver for a DS28E84 "DeepCover Radiation-Resistant, High-Capacity, 1-Wire Authenticator". The slaves in our system are hot-pluggable but only slave one may be attached to a one-wire master at any time. There are multiple masters in our system, so there could be more than 1 active slave present at a time.
I have written a "search" function in the master driver that successfully detects when a slave has been attached to the system, and that information is getting properly passed to the "wire" driver so the correct slave driver is associated with a slave device when the search function detects that new slave is present.
I'm unclear how to indicate back to the "wire" driver that the unplugged slave is no longer present. It isn't something the slave device can signal by itself because slaves can get unplugged without warning at any time. The master can determine when the slave has been unplugged, but I'm not sure how the master driver signals to the "wire" driver that the slave should be removed.
I've tried adding a check in the "search" function to see if a previously present device is no longer present, and if so clearing the "W1_SLAVE_ACTIVE" bit in the flags for that missing slave. I was hoping that would trigger the w1_slave_detach() function in the "wire" driver, but that didn't work.

By reading through the code for the "wire" driver, I discovered that the "wire" driver does automatically remove a slave when that slave doesn't report present from the master's search function. However, that removal doesn't necessarily happen the first time the slave isn't present in a search. Instead there is a counter that keeps track of how many times the slave didn't report present. The slave has to miss reporting present for more than a certain number of searches before the "wire" driver removes it.
In my case I was able to change my master driver's parameters so that it sets that "time to live" (ttl) parameter to 1 instead of the default value, and that forced the slave's removal the first time it didn't report present during a search.

Related

pg_create_logical_replication_slot hanging indefinitely due to old walsender process

I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.
Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.
I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.
I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.
I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.
When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.
I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.
Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".
I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.
Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...
Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...
UPDATE:
"Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.
Strangely enough today everything works without any problems and no matter how I try to simulate yesterday's problems I cannot. Maybe there were some network communication problems in the cloud datacenter involved - we experienced some occasional timeouts in connections into other databases too.
So I really do not know the answer except for "wait until walsender process on master dies" - which can most likely be influenced by tcp_keepalives_* settings. Therefore I recommend to set them to some reasonable values in postgresql.conf because defaults on OS are usually too big.
Actually we use it on our big analytical databases (set both on PostgreSQL and OS) because of similar problems. Golang and nodejs programs calculating statistics from time to time failed to recognize that database session ended or died out in some cases and were hanging until OS ended the connection after 2 hours (default on Debian). All of it seemed to be always connected with network communication problems. With proper tcp_keepalives_* setting reaction is much quicker in case of problems.
After old walsender process dies on master you can repeat all steps and it should work. So looks like I just had bad luck yesterday...

Can I configure EtherCAT slave as S/W?

I am looking at EtherCAT.
I am using embedded Linux.
etherlab and SOEM have been compiled to test that the EtherCAT master functionality is possible.
But I could not find anything about the EtherCAT slave(S/W).
First of all, etherlab had only master function.
SOES also required specific hardware(Lan9252, twrk60).(https://github.com/OpenEtherCATsociety/SOES/tree/master/soes/hal)
I think ethercat slave is also possible if ethercat master is available with ethernet port.
Is EtherCAT slave a physical hardware (device) unconditionally required, unlike the EtherCAT master?
EtherCAT Slave requires physical ESC(EtherCAT Slave Controller).

master slave interaction in redis

I have a master and slave configured on different servers. When master is down, my slave becomes master and everything seems work as it is. BUT when master is recovered, I cant get any keys from the current master(which was slave at first).
Any helps?
Thanks
What probably happens is that the master recovers without reloading the data properly, and the slave syncs with its master, resetting all its data.
A better practice would be to either:
if the master is down, treat the slave as a read only node, not adding any data to it. and make sure the master recovers all the data properly. This will mean no inconsistencies caused by the down time. This is of course only if you can afford read only operation.
Or - when you fail over to the slave, treat it as the new master, and when the old master goes back up, it MUST become a slave and not assume its former role. Redis sentinel does that automatically for you.

MongoDB share-nothing slaves

I'd like to use mongodb to distribute a cached database to some distributed worker nodes I'll be firing up in EC2 on demand. When a node goes up, a local copy of mongo should connect to a master copy of the database (say, mongomaster.mycompany.com) and pull down a fresh copy of the database. It should continue to replicate changes from the master until the node is shut down and released from the pool.
The requirements are that the master need not know about each individual slave being fired up, nor should the slave have any knowledge of other nodes outside the master (mongomaster.mycompany.com).
The slave should be read only, the master will be the only node accepting writes (and never from one of these ec2 nodes).
I've looked into replica sets, and this doesn't seem to be possible. I've done something similar to this before with a master/slave setup, but it was unreliable. The master/slave replication was prone to sudden catastrophic failure.
Regarding replicasets: While I don't imagine you could have a set member invisible to the primary (and other nodes), due to the need for replication, you can tailor a particular node to come pretty close to what you want:
Set the newly-launched node to priority 0 (meaning it cannot become primary)
Set the newly-launched node to 'hidden'
Here are links to more info on priority 0 and hidden nodes.

Replica set never finishes cloning primary node

We're working with an average sized (50GB) data set in MongoDB and are attempting to add a third node to our replica set (making it primary-secondary-secondary). Unfortunately, when we bring the nodes up (with the appropriate command line arguments associating them with our replica set), the nodes never exit the RECOVERING stage.
Looking at the logs, it seems as though the nodes ditch all of their data as soon as the recovery completes and start syncing again.
We're using version 2.0.3 on all of the nodes and have tried adding the third node from both a "clean" (empty db) state as well as a bootstrapped state (using mongodump to take a snapshot of the primary database and mongorestore'ing that snapshot into the new node), each failing.
We've observed this recurring phenomenon over the past 24 hours and any input/guidance would be appreciated!
It's hard to be certain without looking at the logs, but it sounds like you're hitting a known issue in MongoDB 2.0.3. Check out http://jira.mongodb.org/browse/SERVER-5177 . The problem is fixed in 2.0.4, which has an available release candidate.
I don't know if it helps, but when I got that problem, I erased the replica DB and initiated it. It started from scratch and replicated OK. worth a try, I guess.