Kuzzle shuts down on its own - kuzzle

I am trying to use Kuzzle 2.13.3 using docker on a 8GB RAM 4vCPU self-managed cloud instance. Everything seems fine, except that the Kuzzle node shuts down on its own after random period of time. Here are the logs towards the end:
{"level":"error","message":"2021-07-22T06:25:13+00:00 [LOG:ERROR] [knode-icky-vampire-54285] [CLUSTER] Node too slow: ID card expired"}
{"level":"info","message":"2021-07-22T06:25:13+00:00 [LOG:INFO] [knode-icky-vampire-54285] Initiating shutdown..."}
{"level":"info","message":"2021-07-22T06:25:13+00:00 [LOG:INFO] [knode-icky-vampire-54285] Halted."}
[nodemon] clean exit - waiting for changes before restart
what I need to do to get past this issue?

When the node cannot refresh it's ID card inside Redis in time, it will shutdown to ensure always having an healthy cluster.
You can try to increase the heartbeat delay from the configuration with the cluster.heartbeat key. The default value is 2000 (ms).

Related

Keep kubernetes logs of quicky terminating pods

I have an app in my cluster that starts automatically based on a metric and shuts down as soon as its job is down. Apparently the shutdown is to fast because my logging agents (DataDog agent and FluentBit) are not able to pick up the logs before the log file is deleted.
Is there a kubernetes deployment config that ensures the log stays around for a longer time period like and extra minute or so?
I'm using the CRI logging driver
Thanks in advance!

Cassandra pod is taking more bootstrap time than expected

I am running Cassandra as a Kubernetes pod . One pod is having one Cassandra container.we are running Cassandra of version 3.11.4 and auto_bootstrap set to true.I am having 5 node in production and it holds 20GB data.
Because of some maintenance activity and if I restart any Cassandra pod it is taking 30 min for bootstrap then it is coming UP and Normal state.In production 30 min is a huge time.
How can I reduce the bootup time for cassandra pod ?
Thank you !!
If you're restarting the existing node, and data is still there, then it's not a bootstrap of the node - it's just restart.
One of the potential problems that you have is that you're not draining the node before restart, and all commit logs need to be replayed on the start, and this can take a lot of time if you have a lot of data in commit log (you can just check system.log on what Cassandra is doing at that time). So the solution could be is to execute nodetool drain before stopping the node.
If the node is restarted before crash or something like, you can thing in the direction of the regular flush of the data from memtable, for example via nodetool flush, or configuring tables with periodic flush via memtable_flush_period_in_ms option on the most busy tables. But be careful with that approach as it may create a lot of small SSTables, and this will add more load on compaction process.

Compute Engine unhealthy instance down 50% of the time

I started to use google cloud 3 days ago or so, so I am completely new to it.
I have 4 pods deployed to Google Kubernetes Engine:
Frontend: react app,
Redis,
Backend: made up of 2 containers, a nodejs server and a cloudsql-proxy,
Nginx-ingress-controller
** And also have an sql instance running for my postgresql database, hence the cloudsql-proxy container
This setup works well 50% of the time, but every now and then all the pods crash or/and the containers are recreated.
I tried to check all the relevant logs, but I really don't know which are actually relevant. But there is one thing that I found which correlates with my issue, I have 2 VM instances running, and one of them might be the faulty one:
When I hover the loading spin, it says Instance is being verified, and it seems to be in this state 80% of the time, when it is not there is a yellow warning beside the name of the instance, saying The resource is not ready.
Here is the cpu usage of the instance (the trend is the same for all the hardware), I checked in the logs of my frontend and backend containers, here is
the last logs that correspond to a cpu drop:
2019-03-13 01:45:23.533 CET - 🚀 Server ready
2019-03-13 01:45:33.477 CET - 2019/03/13 00:45:33 Client closed local connection on 127.0.0.1:5432
2019-03-13 01:54:07.270 CET - yarn run v1.10.1
As you can see here, all the pods are being recreated...
I think that it might come from the fact that the faulty instance is unhealthy:
Instance gke-*****-production-default-pool-0de6d459-qlxk is unhealthy for ...
...the health check is proceeding and recreating/restarting the instance again and again. Tell me if I am wrong.
So, how can I discover what is making this instance unhealthy?

Fabric Network - what happens when a downed peer connects back to the network?

I recently deployed the fabric network using Docker-compose, I was trying to simulate a downed peer. Essentially this is what happens:
4 peers are brought online using docker-compose running a fabric network
1 peer i.e the 4th peer goes down (done via docker stop command)
Invoke transactions are sent to the root peer which is verified by querying the peers after sometime (excluding the downed peer).
The downed peer is brought back up with docker start. Query transaction run fine on the always on peers but fail on the newly woken up peer.
Why isn't the 4th peer synchronizing the blockchain, once its up.Is there a step to be taken to ensure it does? Or is it discarded as a rogue peer.
This might be due to the expected behavior of PBFT (assuming you are using it). As explained on issue 933,
I think what you're seeing is normal PBFT behavior: 2f+1 replicas are
making progress, and f replicas are lagging slightly behind, and catch
up occasionally.
If you shut down another peer, you should observe
that the one you originally shut off and restarted will now
participate fully, and the network will continue to make progress. As
long as the network is making progress, and the participating nodes
share a correct prefix, you're all good. The reason for f replicas
lagging behind is that those f may be acting byzantine and progress
deliberately slowly. You cannot tell a difference between a slower
correct replica, and a deliberately slower byzantine replica.
Therefore we cannot wait for the last f stragglers. They will be left
behind and sync up occasionally. If it turns out that some other
replica is crashed, the network will stop making progress until one
correct straggler catches up, and then the network will progress
normally.
Hyperledger Fabric v0.6 does not support add peers dynamically. I am not sure for HF v1.0.

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover