Unique identifiers for distributed systems: Kuberbetes stateful set

Unique identifiers for distributed systems: Kuberbetes stateful set - kubernetes

I am using springboot microservices and Kubernetes for our application development and deployments. We are using stateful sets for deployments. We have a need to generate unique identifiers that can be used in the services across pods deployed on nodes across clusters.
I am using twitter snowflake format for the generation of unique Id with K8s stateful sets.
Format:
64 bit java long: 1 bit reserved + 41 bit timestamp + 10 bits nodeId + 12 bits counter
Its a simple problem to solve if we are talking about say 5 replicas deployed on the same cluster (across different nodes). I could easily generate unique orderId that will serve as nodeId in the format above. But the moment I deploy the pods on multiple clusters, it duplicates the nodeIds since it generated similar orders across the clusters.
example: 5 replicas in a single cluster create pods like:
service-0,
service-1,
service-2,
service-3
service-4
I could get the unique pod numbers (0,1,2,3..) that will serve as nodeId in my uniqueId generator and hence I could derive a unique identifier. But if I deploy the application (via stateful set) to 2 clusters, it will generate similar services in the second cluster too:
service-0,
service-1,
service-2,
service-3
service-4
now if my GLB routes my calls randomly across the 10 pods in the 2 cluster, the probability of generating duplicate Unique Id (for calls made during the same milisecond) is pretty high and hence my solution might not work.
Need inputs on how to solve the problem. Any help will be highly appreciated

Hi WhizKid I can give a work around for this try to include your cluster name included in the naming convention (e.g: cluster1-service-0) instead of just going with nodeID. Since we have generated unique IDs for each cluster now even if they lie on the same nodes or same servers it won’t become a problem. Hope this helps you.

Thanks for all the suggestions. We figured out a way to retrieve the cluster Id from the code. We are setting the cluster Id in the Jules file (since using jules for deployment to Kubernetes), and then using the cluster variable in the deployment. That way we are able to retrieve the cluster name in the code. something like this:
Deployment.yml:
......
env:
- name: INFO_KUBE_ENV
value: ${environment}
- name: INFO_KUBE_CLUSTER
value: ${cluster}
and then using the env variable INFO_KUBE_CLUSTER in the code like this:
System.getenv("INFO_KUBE_CLUSTER")

Related

How Databases synchronize data between persistent volumens in Kubernetes

I`ve just read Deploying Cassandra with Stateful Sets topic in the Kubernetes documentation.
The deployment process:
1. Creation of StorageClass
2. Creation of PersistentVolume (in my case 4 PersistentVolume). Set created in 1) storageClassName
3. Creation of Cassandra Headless Service
4. Using a StatefulSet to Create a Cassandra Ring - setting created in 1) storageClassName in StatefulSet yml definition.
As a result, there are 4 pods: Cassandra-0, Cassandra-1, Cassandra-2, Cassandra-4, which are mounted to created in 2) volumes (pv-0, pv-1, pv-2, pv-3).
I wonder how / if these persistent volumes synchronize data with each other.
E.g. if I add some record, which will be written by pod cassandra-0 in persistent volume pv-0, then if someone who is going to retrieve data from the database a moment later - using the cassandra-1 pod/pv will see data that has been added to pv-0. Can anyone tell me how it works exactly?

This is not related to Kubernetes
The replication is done by database and is configurable
See the CAP theorem and Eventual Consistency for Cassandra
You can control the level of consistency in Cassandra, whether the record is immediately updated across or later , depends on the configuration you do in Cassandra.
See also: Synchronous Replication , Asynchronous Replication
Cassandra Consistency:
how to set cassandra read and write consistency
How is the consistency level configured?

The mechanism to spread data across the clusters is independent if it was deployed in kubernetes or bare-metal instances. Cassandra will try to spread randomly the data across the nodes depending on a hash value (known as token), and will use the same algorithm to retrieve the information.
There are other factors to take in consideration: The replication factor (amount of copies), and the consistency level used.
You would want to take a look to DS201: DataStax Enterprise Foundations of Apache Cassandra™ in Datastax academy, where they cover the basics of Cassandra.

Just to slightly extend Carlos' answer, Kubernetes is not involved and the volumes are completely isolated. The replication and distribution stuffs are entirely up to the database software to handle. As far as K8s sees, they are just separate processes and separate volumes.

Thanks for comments guys!
so, when I have my db with 3 PVs:
cassandra-pod0 cassandra-pod1 cassandra-pod2
| | |
cassandra-pv0 cassandra-pv0 cassandra-pv0
Data is divided into 3 pvs.When I kill cassandra-pod1 - it is possible that I will lose (temporarily) part of the data. Am I right?

Kubernetes different container args depending on number of pods in replica set

I want to scale an application with workers.
There could be 1 worker or 100, and I want to scale them seamlessly.
The idea is using replica set. However due to domain-specific reasons, the appropriate way to scale them is for each worker to know its: ID and the total number of workers.
For example, in case I have 3 workers, I'd have this:
id:0, num_workers:3
id:1, num_workers:3
id:2, num_workers:3
Is there a way of using kubernetes to do so?
I pass this information in command line arguments to the app, and I assume it would be fine having it in environment variables too.
It's ok on size changes for all workers to be killed and new ones spawned.

Before giving the kubernetes-specific answer, I wanted to point out that it seems like the problem is trying to push cluster-coordination down into the app, which is almost by definition harder than using a distributed system primitive designed for that task. For example, if every new worker identifies themselves in etcd, then they can watch keys to detect changes, meaning no one needs to destroy a running application just to update its list of peers, their contact information, their capacity, current workload, whatever interesting information you would enjoy having while building a distributed worker system.
But, on with the show:
If you want stable identifiers, then StatefulSets is the modern answer to that. Whether that is an exact fit for your situation depends on whether (for your problem domain) id:0 being "rebooted" still counts as id:0 or the fact that it has stopped and started now disqualifies it from being id:0.
The running list of cluster size is tricky. If you are willing to be flexible in the launch mechanism, then you can have a pre-launch binary populate the environment right before spawning the actual worker (that example is for reading from etcd directly, but the same principle holds for interacting with the kubernetes API, then launching).
You could do that same trick in a more static manner by having an initContainer write the current state of affairs to a file, which the app would then read in. Or, due to all Pod containers sharing networking, the app could contact a "sidecar" container on localhost to obtain that information via an API.
So far so good, except for the
on size changes for all workers to be killed and new one spawned
The best answer I have for that requirement is that if the app must know its peers at launch time, then I am pretty sure you have left the realm of "scale $foo --replicas=5" and entered into the "destroy the peers and start all afresh" realm, with kubectl delete pods -l some-label=of-my-pods; which is, thankfully, what updateStrategy: type: OnDelete does, when combined with the delete pods command.

In the end, I've tried something different. I've used kubernetes API to get the number of running pods with the same label. This is python code utilizing kubernetes python client.
import socket
from kubernetes import client
from kubernetes import config
config.load_incluster_config()
v1 = client.CoreV1Api()
with open(
'/var/run/secrets/kubernetes.io/serviceaccount/namespace',
'r'
) as f:
namespace = f.readline()
workers = []
for pod in v1.list_namespaced_pod(
namespace,
watch=False,
label_selector="app=worker"
).items:
workers.append(pod.metadata.name)
workers.sort()
num_workers = len(workers)
worker_id = workers.index(socket.gethostname())

Service instance count in Azure Fabric Service

Is there a way to find out number of instances of a servicetype that are running in a fabric service cluster at any give time through code? One way is to look at the ApplicationManifest file and get the number of instances set in that, but it might be overwritten sometimes by a parameter file. Any ideas here ?

If you want to examine your services programatically then look at FabricClient which exposes a number of operations that could show you the status deployed services. For your specific question, get the number of running instances, have look at FabricClient.QueryClient.GetReplicaList...(...), it will give you a list of replicas (in the case of StatelessServices, that would be the same as instances).

MongoDB data replication in Kubernetes

I've been configuring pods in Kubernetes to hold a mongodb and golang image each with a service to load-balance. The major issue I am facing is data replication between databases. Replication controllers/replicasets do not seem to do what the name implies, but rather is a blank-slate copy instead of a replica of existing/currently running pods. I cannot seem to find any examples or clear answers on how Kubernetes addresses this, or does it even?
For example, data insertions being sent by the Go program are going to automatically load balance to one of X replicated instances of mongodb by the service. This poses problems since they will all be maintaining separate documents without any relation to one another once Kubernetes begins to balance the connections among other pods. Is there a way to address this in Kubernetes, or does it require a complete re-write of the Go code to expect data replication among numerous available databases?
Sorry, I'm relatively new to Kubernetes and couldn't seem to find much information regarding this.

You're right, a replica set is not a replica of another container, it's just a container with the same configuration spun up within the same logical unit.
A replica set (or deployment, which is the resource you should be using now) will have multiple pods, and it's up to you, the operator, to configure the mongodb part.
I would recommend reading this example of how to set up a replica set with multiple mongodb containers:
https://medium.com/google-cloud/mongodb-replica-sets-with-kubernetes-d96606bd9474#.e8y706grr

Mongodb cluster with aws cloud formation and auto scaling

I've been investigating creating my own mongodb cluster in AWS. Aws mongodb template provides some good starting points. However, it doesn't cover auto scaling or when a node goes down. For example, if I have 1 primary and 2 secondary nodes. And the primary goes down and auto scaling kicks in. How would I add the newly launched mongodb instance to the replica set?
If you look at the template, it uses an init.sh script to check if the node being launched is a primary node and waits for all other nodes to exist and creates a replica set with thier ip addresses on the primary. When the Replica set is configured initailly, all the nodes already exist.
Not only that, but my node app uses mongoose. Part of the database connection allows you to specify multiple nodes. How would I keep track of what's currently up and running (I guess I could use DynamoDB but not sure).
What's the usual flow if an instance goes down? Do people generally manually re-configure clusters if this happens?
Any thoughts? Thanks.

This is a very good question and I went through this very painful journey myself recently. I am writing a fairly extensive answer here in the hope that some of these thoughts of running a MongoDB cluster via CloudFormation are useful to others.
I'm assuming that you're creating a MongoDB production cluster as follows: -
3 config servers (micros/smalls instances can work here)
At least 1 shard consisting of e.g. 2 (primary & secondary) shard instances (minimum or large) with large disks configured for data / log / journal disks.
arbiter machine for voting (micro probably OK).
i.e. https://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/
Like yourself, I initially tried the AWS MongoDB CloudFormation template that you posted in the link (https://s3.amazonaws.com/quickstart-reference/mongodb/latest/templates/MongoDB-VPC.template) but to be honest it was far, far too complex i.e. it's 9,300 lines long and sets up multiple servers (i.e. replica shards, configs, arbitors, etc). Running the CloudFormation template took ages and it kept failing (e.g. after 15 mintues) which meant the servers all terminated again and I had to try again which was really frustrating / time consuming.
The solution I went for in the end (which I'm super happy with) was to create separate templates for each type of MongoDB server in the cluster e.g.
MongoDbConfigServer.template (template to create config servers - run this 3 times)
MongoDbShardedReplicaServer.template (template to create replica - run 2 times for each shard)
MongoDbArbiterServer.template (template to create arbiter - run once for each shard)
NOTE: templates available at https://github.com/adoreboard/aws-cloudformation-templates
The idea then is to bring up each server in the cluster individually i.e. 3 config servers, 2 sharded replica servers (for 1 shard) and an arbitor. You can then add custom parameters into each of the templates e.g. the parameters for the replica server could include: -
InstanceType e.g. t2.micro
ReplicaSetName e.g. s1r (shard 1 replica)
ReplicaSetNumber e.g. 2 (used with ReplicaSetName to create name e.g. name becomes s1r2)
VpcId e.g. vpc-e4ad2b25 (not a real VPC obviously!)
SubnetId e.g. subnet-2d39a157 (not a real subnet obviously!)
GroupId (name of existing MongoDB group Id)
Route53 (boolean to add a record to an internal DNS - best practices)
Route53HostedZone (if boolean is true then ID of internal DNS using Route53)
The really cool thing about CloudFormation is that these custom parameters can have (a) a useful description for people running it, (b) special types (e.g. when running creates a prefiltered combobox so mistakes are harder to make) and (c) default values. Here's an example: -
"Route53HostedZone": {
"Description": "Route 53 hosted zone for updating internal DNS (Only applicable if the parameter [ UpdateRoute53 ] = \"true\"",
"Type": "AWS::Route53::HostedZone::Id",
"Default": "YA3VWJWIX3FDC"
},
This makes running the CloudFormation template an absolute breeze as a lot of the time we can rely on the default values and only tweak a couple of things depending on the server instance we're creating (or replacing).
As well as parameters, each of the 3 templates mentioned earlier have a "Resources" section which creates the instance. We can do cool things via the "AWS::CloudFormation::Init" section also. e.g.
"Resources": {
"MongoDbConfigServer": {
"Type": "AWS::EC2::Instance",
"Metadata": {
"AWS::CloudFormation::Init": {
"configSets" : {
"Install" : [ "Metric-Uploading-Config", "Install-MongoDB", "Update-Route53" ]
},
The "configSets" in the previous example shows that creating a MongoDB server isn't simply a matter of creating an AWS instance and installing MongoDB on it but also we can (a) install CloudWatch disk / memory metrics (b) Update Route53 DNS etc. The idea is you want to automate things like DNS / Monitoring etc as much as possible.
IMO, creating a template, and therefore a stack for each server has the very nice advantage of being able to replace a server extremely quickly via the CloudFormation web console. Also, because we have a server-per-template it's easy to build the MongoDB cluster up bit by bit.
My final bit of advice on creating the templates would be to copy what works for you from other GitHub MongoDB CloudFormation templates e.g. I used the following to create the replica servers to use RAID10 (instead of the massively more expensive AWS provisioned IOPS disks).
https://github.com/CaptainCodeman/mongo-aws-vpc/blob/master/src/templates/mongo-master.template
In your question you mentioned auto-scaling - my preference would be to add a shard / replace a broken instance manually (auto-scaling makes sense with web containers e.g. Tomcat / Apache but a MongoDB cluster should really grow slowly over time). However, monitoring is very important, especially the disk sizes on the shard servers to alert you when disks are filling up (so you can either add a new shard to delete data). Monitoring can be achieved fairly easily using AWS CloudWatch metrics / alarms or using the MongoDB MMS service.
If a node goes down e.g one of the replicas in a shard, then you can simply kill the server, recreate it using your CloudFormation template and the disks will sync across automatically. This is my normal flow if an instance goes down and generally no re-configuration is necessary. I've wasted far too many hours in the past trying to fix servers - sometimes lucky / sometimes not. My backup strategy now is run a mongodump of the important collections of the database once a day via a crontab, zip up and upload to AWS S3. This means if the nuclear option happens (complete database corruption) we can recreate the entire database and mongorestore in an hour or 2.
However, if you create a new shard (because you're running out of space) configuration is necessary. For example, if you are adding a new Shard 3 you would create 2 replica nodes (e.g. primary with name => mongo-s3r1 / secondary with name => mongo-s3r2) and 1 arbitor (e.g. with name mongo-s3r-arb) then you'd connect via a MongoDB shell to a mongos (MongoDB router) and run this command: -
sh.addShard("s3r/mongo-s3r1.internal.mycompany.com:27017,mongo-s3r2.internal.mycompany.com:27017")
NOTE: - This commands assumes you are using private DNS via Route53 (best practice). You can simply use the private IPs of the 2 replicas in the addShard command but I have been very badly burned with this in the past (e.g. serveral months back all the AWS instances were restarted and new private IPs generated for all of them. Fixing the MongoDB cluster took me 2 days as I had to reconfigure everything manually - whereas changing the IPs in Route53 takes a few seconds ... ;-)
You could argue we should also add the addShard command to another CloudFormation template but IMO this adds unnecessary complexity because it has to know about a server which has a MongoDB router (mongos) and connect to that to run the addShard command. Therefore I simply run this after the instances in a new MongoDB shard have been created.
Anyways, that's my rather rambling thoughts on the matter. The main thing is that once you have the templates in place your life becomes much easier and defo worth the effort! Best of luck! :-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse