Vespa.ai storage 0 down - kubernetes

I recently start using vespa, and I deployed a cluster on Kubernetes and index some data.. but toda one of storage shows down on "vespa-get-cluster-state":
[vespa#vespa-0 /]$ vespa-get-cluster-state
Cluster feature:
feature/storage/0: down
feature/storage/1: up
feature/storage/2: up
feature/distributor/0: up
feature/distributor/1: up
feature/distributor/2: up
I don't know what is this storage... this cluster had 2 content nodes, 2 containers nodes and 1 master.
How see logs and diagnostic why this down.

Just a tip: This question would work better on the Vespa Slack, or as a github issue.
According to the message you shared you have 3 content nodes (each have a "storage service responsible for storing the data, and a "distributor" service, responsible for managing a subset of the data space. The reason the node is down is not included in this message but you can find it by running vespa-logfmt -l warning,error on your admin node.

Related

How to prevent data inconsistency when one node lose network connectivity in kubernetes

I have a situation where I have a cluster with a service (we named it A1) and its data which is on a remote storage like cephFS in my case. the number of replica for my service is 1. Assume I have 5 node in my cluster and service A1 reside in node 1. something happens with node 1 network and it lose the connectivity with cephFS cluster and my Kubernetes cluster as well (or docker-swarm). cluster mark it as unreachable and start a new service (we named it A2) on node 2 to keep replica as 1. after for example 15 min node 1 network fixed and node 1 get back to cluster and have service A1 running already (assume it didn't crash while it loses its connectivity with remote storage).
I worked with docker-swarm and recently switched to Kubernetes. I see Kuber has a feature call StatefulSet but when I read about it. it doesn't answer my question. (or I may miss something when I read about it)
Question A: what does cluster do. does it keep A2 and shutdown A1 or let A1 keeps working and shutdown A2 (Logically it should shutdown A1)
Question B (and my primary question as well!): Assume that the cluster wants to shutdown on of these services (for example A1). This service does some save on storage when it wants to shutdown. in this case state A1 save to disk and A2 with newer state saved something before A1 network get fixed.
There must be some locks when we mount the volume to the container in which when it attached to one container other container cant write to that (let A1 failed when want to save its old state data on disk)
The way it works - using docker swarm terminology -
You have a service. A service is a description of some image you'd like to run, how many replicas and so on. Assuming the service specifies at least 1 replica should be running it will create a task that will schedule a container on a swarm node.
So the service is associated with 0 to many tasks, where each task has 0 - if its still starting or 1 container - if the task is running or stopped - which is on a node.
So, when swarm (the orcestrator) detects a node go offline, it principally sees that a number of tasks associated with a service have lost their containers, and so the replication (in terms of running tasks) is no longer correct for the service, and it creates new tasks which in turn will schedule new containers on the available nodes.
On the disconnected node, the swarm worker notices that it has lost connection to the swarm managers so it cleans up all the tasks it is holding onto as it no longer has current information about them. In the process of cleaning the tasks up, the associated containers get stopped.
This is good because when the node finally reconnects there is no race condition where there are two tasks running. Only "A2" is running and "A1" has been shut down.
This is bad if you have a situation where nodes can lose connectivity to the managers frequently, but you need the services to keep running on those nodes regardless, as they will be shut down each time the workers detach.
The process on K8s is pretty much the same just change the terminology.

Innodb Cluster upgradeMetadata on broken cluster

We have a cluster of 3 nodes, 2 of them are offline (missing) and I cannot get them to rejoin the cluster automatically only the master is Online.
Usually, you can use innodb admin:
var cluster = dba.getCluster();
but I cannot use the cluster instance because the metadata is not up to date. But I cannot upgrade the meta data because the missing members are required to be online to use dba.upgradeMetadata(). (Catch 22)
I tried to dissolve the cluster by using:
var cluster = dba.rebootClusterFromCompleteOutage();
cluster.dissolve({force:true});
but this requires the metadata to be updated as well.
Question is, how do I dissolve the cluster completely or upgrade the metadata so that I can use the cluster. methods.
This "chicken-egg" issue was fixed in MySQL Shell 8.0.20. dba.rebootClusterFromCompleteOutage() is now allowed in such situation:
BUG#30661129 – DBA.UPGRADEMETADATA() AND DBA.REBOOTCLUSTERFROMCOMPLETEOUTAGE() BLOCK EACH OTHER
More info at: https://mysqlserverteam.com/mysql-shell-adminapi-whats-new-in-8-0-20/
If you have a cluster where each node upgrades to the latest version of mysql and the cluster isn't fully operational and you need to update your metadata for mysqlsh, you'll need to use an older version of mysqlsh for example, https://downloads.mysql.com/archives/shell/ to get the cluster back up and running. Once it is up and running you can use the dba.upgrademetadata on the R/W node - make sure you update all of your routers or they will lose connection.

Elastic Cloud APM Server - Queue is full

I have many Java microservices running in a Kubernetes Cluster. All of them are APM agents sending data to an APM server in our Elastic Cloud Cluster.
Everything was working fine but suddenly every microservice received the error below showed in the logs.
I tried to restart the cluster, increase the hardware power and I tried to follow the hints but no success.
Obs: The disk is almost empty and the memory usage is ok.
Everything is in 7.5.2 version
I deleted all the indexes related to APM and everything worked after some minutes.
for better performance u can fine tune these fields in apm-server.yml file
internal queue size increase queue.mem.events=output.elasticsearch.worker * output.elasticsearch.bulk_max_size
default is 4096
output.elasticsearch.worker (increase) default is 1
output.elasticsearch.bulk_max_size (increase) default is 50 very less
Example : for my use case i have used following stats for 2 apm-server nodes and 3 es nodes (1 master 2 data nodes )
queue.mem.events=40000
output.elasticsearch.worker=4
output.elasticsearch.bulk_max_size=10000

MongoDB data replication in Kubernetes

I've been configuring pods in Kubernetes to hold a mongodb and golang image each with a service to load-balance. The major issue I am facing is data replication between databases. Replication controllers/replicasets do not seem to do what the name implies, but rather is a blank-slate copy instead of a replica of existing/currently running pods. I cannot seem to find any examples or clear answers on how Kubernetes addresses this, or does it even?
For example, data insertions being sent by the Go program are going to automatically load balance to one of X replicated instances of mongodb by the service. This poses problems since they will all be maintaining separate documents without any relation to one another once Kubernetes begins to balance the connections among other pods. Is there a way to address this in Kubernetes, or does it require a complete re-write of the Go code to expect data replication among numerous available databases?
Sorry, I'm relatively new to Kubernetes and couldn't seem to find much information regarding this.
You're right, a replica set is not a replica of another container, it's just a container with the same configuration spun up within the same logical unit.
A replica set (or deployment, which is the resource you should be using now) will have multiple pods, and it's up to you, the operator, to configure the mongodb part.
I would recommend reading this example of how to set up a replica set with multiple mongodb containers:
https://medium.com/google-cloud/mongodb-replica-sets-with-kubernetes-d96606bd9474#.e8y706grr

Resource Management of Bluemix Auto - Scaling

What are the impacts of bluemix auto-scaling in terms of resource management. For example if a runtime is specified with 1 GB of memory and auto-scaling is set to 2 instances, does the application consume 2 GB?
Same question for the disk allocated for the runtime?
Are logs from the various instances combined automatically?
If an instance is currently serving a REST request (short), how does Auto-Scaling make sure that the request is not interrupted while being served?
When you say, "a runtime is specified with 1 GB of memory and auto-scaling is set to 2 instances" I assume that you set your group/application up such that each instance is given 1 GB of memory and you are asking what will happen if the Auto-Scaling service scales up your group/application to 2 instances.
Memory/Disk
For example if a runtime is specified with 1 GB of memory and auto-scaling is set to 2 instances, does the application consume 2 GB? Same question for the disk allocated for the runtime?
Yes, your application will now consume 2 GB of your total memory quota. The same applies for disk allocation.
The Auto-Scaling service will deploy a new instance with the same configuration as your existing instances. If you've set up your group/application such that each instance gets 1 GB of memory, then when Auto-Scaling increases your group's instance count from 1 to 2 your application will now consume 2 GB of memory, assuming that adding another GB doesn't go beyond your memory quota. The same applies with disk allocation and quota.
Logs
Are logs from the various instances combined automatically?
Yes, the logs are combined automatically.
Cloud Foundry applications combine logs as well. For more information about viewing these logs check out the documentation.
The IBM Containers service sends logs to IBM's Logmet service. For more information check out the documentation.
Handling REST requests without interruption
If an instance is currently serving a REST request (short), how does Auto-Scaling make sure that the request is not interrupted while being served?
Adding an instance to the group/application: no interruption
If an instance is being added to the group then there will be no interruption to existing requests because any previously existing instances are not touched or altered by the Auto-Scaling service.
Removing an instance from the group/application: possible interruption
At this time, the Auto-Scaling service does not support protecting ongoing requests from being dropped during a scale down operation. If the request is being processed by the instance that is being removed, then that request will be dropped. It is up to the application to handle such cases. One option is your application could store session data in external storage to allow the user to retry the request.
Additional Information
There are currently two different Auto-Scaling services in Bluemix:
Auto-Scaling for Cloud Foundry applications exists in all Bluemix regions and is available as a service you bind to your existing Cloud Foundry application.
Auto-Scaling for Container Groups currently is available as a beta service for the London region in the new Bluemix console.
The answers to your questions above are applicable to both services.
I hope this helps! Happy scaling!