How to balance MUC processes in an ejabberd cluster - xmpp

We have an ejabberd cluster with about 700,000 persistent chat rooms. The problem we are facing is that starting the first node of the cluster takes about 1 hour as the Erlang processes for all the rooms are started on that node.
Even after the cluster is initialized and we have say 4 nodes running, the MUC processes aren't balanced across the nodes. There can be a node using 90% of it's memory and another node using 5%.
Is there a way to start multiple nodes of a cluster at once so that the MUC load is spread evenly from the beginning and the startup is faster?
Can anyone suggest a solution for balancing the MUC processes between the cluster nodes?
The way it works now is obviously not scalable, because as the number of rooms grows, we need more and more RAM on the first node that is started in the cluster and also, the startup time increases.
Many thanks,
Alex

In current ejabberd Community Edition there is no such feature. You would need a customized MUC module to match your specific large number of rooms requirement. It is more that load balancing the MUC across the cluster. You need to aggressively optimize for RAM for large scale, at the expense of storage and CPU.

You can load MUC rooms dynamically.Whenever required at that time only you start the room with old options(persistent room means start room with old configuration).
Using this mechanism both problems will be resolved:
1) Ejabberd Starting time(Not starting any MUC rooms since it is on need basis)
2) Load balancing (Because if first requests for that room goes to other node then room will be created there.

Related

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Distribute children elements or the top-level parent elements among monitoring hosts

I've a monitoring system that periodically heartbeats compute resources assigned to my company's customers.
By default, each customer is assigned one physical server, but some of the bigger customers are assigned multiple physical servers.
The monitoring fleet has a bunch of servers running a monitoring software. As the number of customers with >1 physical servers are increasing, I am worried about uneven distribution of load, noisy neighbor problems etc.
However, I am unsure which of the following approaches I should adopt:
Distribute individual servers among the monitoring hosts. As a result, servers assigned to single customer may be monitored by different hosts. Note that we don't save any server state in the RAM of the monitoring host, so this distribution will not affect the correctness of my monitoring logic.
Distribute individual clusters among the monitoring hosts and use a load balancing algorithm with weight = size of clusters.
Any help is appreciated.
[Note: I am sorry if this is not the right place for this question]

Apache Geode scaling

I'm trying to measure the performance of Geode
I have 3 identical hosts to test it.
I created a partitioned region.
I started a geode cluster with one server.
I do "get" and "put" operations in the loop.
I get about 50000 op/sec.
Add started a cluster with three geode nodes.
I do get and put operations in the loop.
I get the same 50000 op/sec.
I would expect to see the increased performance, but it is suprisingly the same for 1-node cluster and 3-nodes cluster.
Could you please help. What are the possible settings to change in order to get horizontal scalability.
Thank you.
Well, you just got horizontal scalability for data storage at no loss of throughput :)
To horizontally scale your throughput, I think your workload was not enough to max-out the server. You need to start multiple clients (OR threads in a single client) against a single server until you do not see throughput increase by adding any new clients. At this point you start a new server. This new server should allow you to add more clients and horizontally scale your throughput.
You may find the ycsb benchmark useful, which allows you to start multiple threads in a client to perform operations.
You should setuo and environment who you see a performance decrease with single node and then make same test with partitioned one.

Apache Zookeeper: distribution of nodes across data centers

I am working on a brand new SolrCloud - ZooKeeper infrastructure.
Some background information:
all other services (mostly web site infrastructure) are distributed across two data centers, with active-active configurations.
at the network level, the servers are setup on extended LANs, with dark fibre across the data centers. So latency is at a minimum.
the SolrCloud - ZooKeeper infrastructure will be used by most of these applications.
I got a SolrCloud, and a ZooKeeper ensemble running. Implementation at this level is fine.
But I wonder how to distribute my ZooKeeper servers. I must have an odd number of servers, but I only have two data centers. If one fails, I have a 50-50 chance that I will lose majority.
What should I do? So far I have thought of:
requesting a third data center (not likely to happen, $$$!)
host two per data center and two on an external cloud provider (Amazon or ...?). Again $$$
set up an odd number at data center 1 and use an observer on site 2. What then happens if site 1 fails? Can SolrCloud work with only one observer?
If your requirement is to serve all search requests from a local data center (at which request was origin) then you don’t need to go for a cross data center ZooKeeper deployment.
Because a cross data center ZooKeeper deployment is only needed to survive a DC crash (it is most likely not going to happen, and that's why you pay $$$$), so in that case there isn't any need to spawn a ZooKeeper cluster in multiple data centers.
I got a third site to host the other ZooKeeper instance. This site is another office of my company, not a "full data center". So each site has one ZooKeeper instance.
What allowed me to have one cluster spread over three data centers was that they are close enough together to get a dark fiber between them. The latency is very low and does not impact ZooKeeper performance.
Then for Solr, I got full replicas on the two main data centers. The third office only hosts a ZooKeeper for quorum. Using full replicas, I have all the data in each data center. If my Solr needs to increase later, I will shard, but for now our index is small.
It has proven solid for four years now, with one failure. And it was at the third office, not in a data center.

Maximum servers in a ZooKeeper ensemble cluster?

Use case: 100 Servers in a pool; I want to start a ZooKeeper service on each Server and Server applications (ZooKeeper client) will use the ZooKeeper cluster (read/write). Then there is no single point of failure.
Is this solution possible for this use case? What about the performance?
What if there are 1000 Servers in the pool?
If you are simply trying to avoid a single point of failure, then you only need 3 servers. In a 3 node ensemble, a single failure can be tolerated with the remaining 2 nodes forming the quorum. The more servers you have the worse write performance will be. And 100 servers is the extreme of this, if ZK can even handle it.
However, having that many clients is no problem at all. Zookeeper has active deployments with many more than 1000 clients. If you find that you need more servers to handle your read load, you can always add Observers. I highly recommend you join the list serve. It is an excellent way to quickly have your questions answered, and likely in much more detail than anyone will give you on SO.
Maybe zookeeper is not the right tool?
Hazelcast does what you want, I think. You can hundreds of peers, and if the master is lost a new one is elected from all the peers.
You don't need to use all of hazel cast. You can just use the maps, or just the worker pools, or just the synchronisation primitives, or just the messaging etc.