How to handle WebSocket-dependent data on server crash?

How to handle WebSocket-dependent data on server crash? - postgresql

I'm using a postgres database to maintain a list of rooms and connected users to each room.
Users can enter a room whenever they want, but they should leave the room when they close the browser.
A good flow of events should be
User enters room (user's room var is set) -> ... -> User disconnects
and server notices (user's room var is unset)
But what if this happens?
User enters room (user's room var is set) -> ... -> Server crashes or
shuts down for updates -> User disconnects and server doesn't notice
(user's room var is still set) -> Server is back on
In this last case, the database state is already broken. What's the best way to deal with something like this? Thanks

Let's divide the answer into 2 aspects:
User Aspect:
Regardless of the language at hand, you should be made aware of disconnection events using a Socket event/exception handling.
If the server crashes, your user will experience an abrupt socket disconnection/connection closing/session termination, depending on which framework your are using. TCP Sockets also have keepalive (SO_KEEPALIVE) exactly for that (you can usually control these (or similar) settings from the high-level protocol.
So, all you need to to do in that case is run maintenance code on the user's end (unset a variable in you describe case)
Server Aspect:
It's a bit trickier here. What you are basically looking for is ephemeral state management, meaning, the ability to react to abprut service/server termination (server crashes that result in an corrupted/unclean state), and clean-up after them
For that, Technologies like Zookeeper or Consul exist. I personally recommend Zookeeper, as I have built similar solutions on top of it in the past, several times.
With zookeeper, when your server startup, it can, for instance, creates an EPHEMERAL node. That node will be created once the server goes up, and will remain there for as long as the server is alive and connected to the Zookeeper cluster. If the server crashes unexpectedly, This node is removed.
You can then have a separate application/script that listens to events on that zk node/path. If it's suddenly remove, you can run a cleanup routine on the database.
This approach supports multiple app instances of course - you can listen on an events under a path and have all server instance register using different nodes under it. The removed node can contain instance specific identifiers, and you can use those to clean up specific instance state from the database.
It can also be a wise choice to remove clean-up/maintenance duty to a separate component
(Note that ZooKeeper requires careful attention when dealing connection/state events)
Some additional Zookeeper reading material
Final Thoughts:
Of course the answer can be fine-tuned based on specific needs that were not presented in the question.
When building complex, stateful solution, I personally aim to deal with crashes on all ends of the solutions, playing 'safe' where possible

Related

how to design a realtime database update system?

I am designing a whatsapp like messenger application for the desktop using WPF and .Net. Now, when a user creates a group I want other members of the group to receive a notification that they were added to a group. My frontend is built in C#.Net, which is connected to a RESTful Webservice (Ruby on Rails). I am using Postgres for the database. I also have a Redis layer to cache my rails models.
I am considering the following options.
1) Use Postgres's inbuilt NOTIFY/LISTEN mechanism which the clients can subscribe to directly. I foresee two issues here
i) Postgres might not be able to handle 10000's of clients subscribed directly.
ii) There is no guarantee of delivery if the client is disconnected
2) Use Redis' Pub/Sub mechanism to which the clients can subscribe. I am still concerned with no guarantee of delivery here.
3) Use a messaging queue like RabbitMQ. The producer of this queue will be postgres which will push in messages through triggers. The consumer of-course will be the .Net clients.
So far, I am inclined to use the 3rd option.
Does anyone have any suggestions how to design this?

In an application like WhatsApp itself, the client running in your phone is an integral part of a large and complex event-based, distributed system.
Without more context, it would be impossible to point in the right direction. That said:
For option 1: You seem to imply that each client, as in a WhatsApp client, would directly (or through some web service) communicate with Postgres as an event bus, which is not sound and would not scale because you can only have ONE Postgres instance.
For option 2: You have the same problem that in option 1 with worse failure modes.
For option 3: RabbitMQ seems like a reasonable ally here. It is distributed in nature and scales well. As a matter of fact, it runs on erlang just as most of WhatsApp does. Using triggers inside Postgres to publish messages however does not make a lot of sense.
You need a message bus because you would have lots of updates to do in the background, not to directly connect your users to each other. As you said, clients can be offline.
Architecture is more about deferring decisions than taking them.
I suggest that you start simple. Build a small, monolithic, synchronous system first, pushing updates as persisted data to all the involved users. For example; In a group of n users, just write n records to a table. It is already complicated to reliably keep track of who has received and read what.
This heavy "group" updates can then be moved to long-running processes using RabbitMQ or the like, but a system with several thousand users can very well work without such thing, especially because a simple message from user A to user B would not need many writes.

Reference counted Pub/Sub system

I am searching for a way to design my system that consists of multiple publishers, multiple channels and multiple subscribers, all of which can be uniquely identified easily.
I need to send messages in both directions, with as low as possible latency. However, if a subscriber dies, the messages he subscribed to should not be dropped, when it comes back online, it should receive all pending messages. Since I handle with very high numbers of messages (up to 1000 per second happens on a regular basis) while having a low-spec server, meaning keeping lists of all messages at all times is not an option.
I was considering if a reference count/list for messages is a viable option. When a message is published, it is initialized with a list of subscribers to that specific channel, when a subscriber receives the message, the subscriber is removed from the list. The message is removed if the list is empty.
Now, if a subscriber dies without unsubscribing, the messages will not be removed because the list of missing subscribers is not empty. When it comes back online, it will be able to receive the list of all pending messages, since it identifies with the same ID as the dead instance.
Perhaps it would be required to have messages/subscribers time out, for example if a subscriber has been inactive for 10 minutes, all list entries containing it are cleared.
Is that a good idea, have I forgotten problems that could arise with this system in particular? Is there any system that already does this? RabbitMQ and similar PubSub systems dont seem to have this - if not, I guess redis is the way to go?

I can imagine managing reference count for the purposes of message lifecycles. This sounds reasonable in terms of message and memory management during normal Service operation. Of course, timeouts provide patch for references from dead services.
However in terms of health monitoring and service recovery issues this is quite another story.
The danger that I currently see here is state management. Imagine a service that is a stateful subscriber (i.e. has a State Machine) that is driven from it initial state (I) to a certain state (S). Each message is being processed differently in different states. Now imagine that your service dies and gets restarted. Meanwhile some messages are stored and after the service is back online, they are dispatched to it. However the Service receives them in the wrong state (I instead of S) and acts unexpectedly.
Can you restore the service in the exact state it was when crashed? In practice, this is extremely difficult since even in the State Machine approach the service has side effects / communicates with global state(s) etc.
Bottomline, reference counting seems reasonable in terms of managing Messages, but mixing it with health monitoring results in lots of complexity issues.

Should I use separate connections for Pub and Sub with Redis?

I have noticed that Socket.io is using two separate connections for Pub and Sub to Redis server. Is it something that could improve the performance? Or is it just purely a move towards more organized event handlers and code? What are the benefits and drawbacks of the two separate connections and one single connections for publishing and subscribing.
P.S. The system is pushing about an equal number of messages that it is receiving. It pushes updates to the servers, which are on the same level in the hierarchy, so there is no master, pushing all of the updates, or slave, consuming the messages. One server would have about 4-8 subscriptions and it will send the messages back to these servers.
P.S.S. Is this more of a job for a purpose-built job queue? The reason I am looking at Redis. is that I am already keeping some shared objects in it, which are used by all servers. Is message queue worth adding yet another network connection?

You are required to use two connections for pub and sub. A subscriber connection cannot issue any commands other than subscribe, psubscribe, unsubscribe, punsubscribe (although #Antirez has hinted of a subscriber-safe ping in the future). If you try to do anything else, redis tells you:
-ERR only (P)SUBSCRIBE / (P)UNSUBSCRIBE / QUIT allowed in this context
(note that you can't test this with redis-cli, since that understands the protocol well enough to prevent you from issuing commands once you have subscribed - but any other basic socket tool should work fine)
This is because subscriber connections work very differently - rather than working on a request/response basis, incoming messages can now come in at any time, unsolicited.
publish is a regular request/response command, so must be sent on a regular connection, not a subscriber connection.

Asterisk HA and SIP registration

I setup an Active/Passive cluster with Pacemaker/Corosync/DRBD. I wanted to make an Asterisk server HA. The solution works perfectly but when the service fails on one server and starts on another all registered SIP clients with the active server will be lost. And the passive server show nothing in the output of:
sip show peers
Until clients make a call or register again. One solution is to set the Registration rate on clients to 1 Min or so. Are there other options? For example integrating Asterisk with a DBMS helps to save this kind of state in a DB??

First of all doing clusters by non-expert is bad idea.
You can use realtime sip architecture, it save state in database. Complexity - average. Note, "sip show peers" for realtime also show nothing.
You can use memory duplicating cluster(some solution for xen exists) which will copy memory state from one server to other. Complexity - very complex.

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies (of, say, hardware failure), the system needs to keep running. I think I want the clients to have a list of machines they will try to connect with, similar to how Cassandra works.
The multi-node examples I've seen so far with Akka seem to me to be centered around the idea of scalability, rather than high availability (at least with regard to hardware). The multi-node examples seem to always have a single point of failure. For example there are load balancers, but if I need to reboot one of the machines that have load balancers, my system will suffer some downtime.
Are there any examples that show this type of hardware fault tolerance for Akka? Or, do you have any thoughts on good ways to make this happen?
So far, the best answer I've been able to come up with is to study the Erlang OTP docs, meditate on them, and try to figure out how to put my system together using the building blocks available in Akka.
But if there are resources, examples, or ideas on how to share state between multiple machines in a way that if one of them goes down things keep running, I'd sure appreciate them, because I'm concerned I might be re-inventing the wheel here. Maybe there is a multi-node STM container that automatically keeps the shared state in sync across multiple nodes? Or maybe this is so easy to make that the documentation doesn't bother showing examples of how to do it, or perhaps I haven't been thorough enough in my research and experimentation yet. Any thoughts or ideas will be appreciated.

HA and load management is a very important aspect of scalability and is available as a part of the AkkaSource commercial offering.

If you're listing multiple potential hosts in your clients already, then those can effectively become load balancers.
You could offer a host suggestion service and recommends to the client which machine they should connect to (based on current load, or whatever), then the client can pin to that until the connection fails.
If the host suggestion service is not there, then the client can simply pick a random host from it internal list, trying them until it connects.
Ideally on first time start up, the client will connect to the host suggestion service and not only get directed to an appropriate host, but a list of other potential hosts as well. This list can routinely be updated every time the client connects.
If the host suggestion service is down on the clients first attempt (unlikely, but...) then you can pre-deploy a list of hosts in the client install so it can start immediately randomly selecting hosts from the very beginning if it has too.
Make sure that your list of hosts is actual host names, and not IPs, that give you more flexibility long term (i.e. you'll "always have" host1.example.com, host2.example.com... etc. even if you move infrastructure and change IPs).

You could take a look how RedDwarf and it's fork DimDwarf are built. They are both horizontally scalable crash-only game app servers and DimDwarf is partly written in Scala (new messaging functionality). Their approach and architecture should match your needs quite well :)

2 cents..
"how to share state between multiple machines in a way that if one of them goes down things keep running"
Don't share state between machines, instead partition state across machines. I don't know your domain so I don't know if this will work. But essentially if you assign certain aggregates ( in DDD terms ) to certain nodes, you can keep those aggregates in memory ( actor, agent, etc ) when they are being used. In order to do this you will need to use something like zookeeper to coordinate which nodes handle which aggregates. In the event of failure you can bring the aggregate up on a different node.
Further more, if you use an event sourcing model to build your aggregates, it becomes almost trivial to have real-time copies ( slaves ) of your aggregate on other nodes by those nodes listening for events and maintaining their own copies.
By using Akka, we get remoting between nodes almost for free. This means that which ever node handles a request that might need to interact with an Aggregate/Entity on another nodes can do so with RemoteActors.
What I have outlined here is very general but gives an approach to distributed fault-tolerance with Akka and ZooKeeper. It may or may not help. I hope it does.
All the best,
Andy

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse