Query to find different combinations of CPU and Memory in the cluster - kubernetes

I was wondering if it's possible to write a query to show the number of nodes in the cluster with a given cpu and memory configuration.
I have a metric kube_node_status_allocatable available with different tags.
Metric 1 (cpu count on each node):
kube_node_status_allocatable{instance="ip1",node="host1",resource="cpu",unit="core"} 21
kube_node_status_allocatable{instance="ip2",node="host2",resource="cpu",unit="core"} 21
kube_node_status_allocatable{instance="ip3",node="host3",resource="cpu",unit="core"} 61
kube_node_status_allocatable{instance="ip4",node="host4",resource="cpu",unit="core"} 61
kube_node_status_allocatable{instance="ip5",node="host5",resource="cpu",unit="core"} 61
Metric 2 (memory count on each node)::
kube_node_status_allocatable{instance="ip1",node="host1",resource="memory",unit="gb"} 64
kube_node_status_allocatable{instance="ip2",node="host2",resource="memory",unit="gb"} 64
kube_node_status_allocatable{instance="ip3",node="host3",resource="memory",unit="gb"} 128
kube_node_status_allocatable{instance="ip4",node="host4",resource="memory",unit="gb"} 128
kube_node_status_allocatable{instance="ip5",node="host5",resource="memory",unit="gb"} 128
I want to output a metric that looks something like this:
{cpu=21, memory=64} 2
{cpu=61, memory=128} 3
So far I have been able to get number of nodes with a given configuration for one resource at a time.
i.e., number of nodes with different cpu configuration
count_values("node", kube_node_status_allocatable{resource="cpu"})
Above outputs:
{node=21} 2
{node=61} 3
Which roughly maps to configuration (cpu == 21 or 61) and the number of nodes with that configuration (2 or 3).
I can get a similar result for memory, but I am not sure how to join these two.

Related

Is there a cloudwatch equivalent for prometheus Counter?

I need to be able to atomically increment (or decrement) a metric value in cloudwatch (and also be able to reset it to zero). Prometheus provides a Counter type that allows one to do this; is there an equivalent in cloudwatch? All I'm able to find is a way to add a new sample value to a metric, but not increment or decrement it.
CloudWatch is like a TSDB. It stores point-in-time values. You can't mutate a metric value once it is ingested. See Publishing Metrics. Also, I don't think storing a counter in CloudWatch will be very useful. There is no rate(...) function in CloudWatch like in Prometheus. The best you can do is store the deltas and use the sum statistic with a period. Here is an e.g. assuming metrics are ingested at 1m granularity
Time
Counter
rate(5m)
CW metric
sum with period 5m
1m
0
0
0
0
2m
10
10
10
10
3m
20
20
10
20
4m
40
40
20
40
5m
50
50
10
50
6m
60
60
10
60
7m
100
90
40
90
Note that metrics can be ingested at finer granularity but it comes at a cost. Also, the statistics (Sum,Average,Maximum,Minimum etc) can be retrieved only at 1 minute granularity. There is an option to retrieve the raw data when retrieving a statistic but not sure what would be the use of doing so.

Erasure Coded Pool suggested PG count

I'm messing around with pg calculator to figure out the best pg count for my cluster. I have an erasure coded FS pool which will most likely use half space of the cluster in the forseeable future. But the pg calculator only has options for replicated pools. Should i just type according to the erasure-code ratio for replica # or is there another way around this?
From Ceph Nautilus version onwards there's a pg-autoscaler that does the scaling for you. You just need to create a pool with an initial (maybe low) value. As for the calculation itself your assumption is correct, you take the number of chunks into account when planning the pg count.
From :
redhat docs:
3.3.4. Calculating PG Count
If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability and distribution. If you have less than 50 OSDs, choosing among the PG Count for Small Clusters is ideal. For a single pool of objects, you can use the following formula to get a baseline:
(OSDs * 100)
Total PGs = ------------
pool size
Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph osd erasure-code-profile get).
You should then check if the result makes sense with the way you designed your Ceph cluster to maximize data durability, data distribution and minimize resource usage.
The result should be rounded up to the nearest power of two. Rounding up is optional, but recommended for CRUSH to evenly balance the number of objects among placement groups.
For a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows:
(200 * 100)
----------- = 6667. Nearest power of 2: 8192
3
With 8192 placement groups distributed across 200 OSDs, that evaluates to approximately 41 placement groups per OSD. You also need to consider the number of pools you are likely to use in your cluster, since each pool will create placement groups too. Ensure that you have a reasonable maximum PG count.

aerospike bad latencies with aws

We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.
Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.
The version in aws is aerospike-server-community-3.15.0.2 the version in Sl is Aerospike Enterprise Edition 3.6.3
Our config as follows
#Aerospike database configuration file.
service {
user xxxxx
group xxxxx
run-as-daemon
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
#Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
port 13000
address h1 reuse-address
}
heartbeat {
mode mesh
port 13001
address h1
mesh-seed-address-port h1 13001
mesh-seed-address-port h2 13001
mesh-seed-address-port h3 13001
mesh-seed-address-port h4 13001
interval 150
timeout 10
}
fabric {
port 13002
address h1
}
info {
port 13003
address h1
}
}
namespace XXXX {
replication-factor 2
memory-size 27G
default-ttl 10d
high-water-memory-pct 70
high-water-disk-pct 60
stop-writes-pct 90
storage-engine device {
device /dev/nvme0n1
scheduler-mode noop
write-block-size 128K
}
}
What should be done to bring down latencies in aws?
This comes down to the difference in the performance characteristics of the SSDs of the i3 nodes, compared to what you had on Softlayer. If you ran Aerospike on a floppy disk you'd get 0.5TPS.
Piyush's comment mentions ACT, the open source tool Aerospike has created to benchmark SSDs with real database workloads. The point of ACT is to find the sustained rate in which the SSD can be relied on to deliver the latency you want. Burst rates don't matter much for databases.
The performance engineering team at Aerospike has used ACT to find what the i3 1900G SSD can do, and published the results in a post. Its ACT rating is 4x, meaning that the full 1900G SSD can do 8Ktps reads, 4Ktps writes with the standard 1.5K object size, 128K block size, and stay at 95% < 1ms, 99% < 8ms, 99.9% < 64ms. This is not particularly good for an SSD. By comparison, a Micron 9200 PRO rates at 94.5x, nearly 24 times higher TPS load. What more, with the i3.xlarge you're sharing half that drive with a neighbor. There's no way to cap the IOPS so that you each get half, there's only a partition of the storage. This means that you can expect latency spikes originating in the neighbor. The i3.2xlarge is the smallest instance that gives you the entire SSD.
So, you take the ACT information and you use it to do capacity planning. The main factors you need to know are the average object size (you can find that using objsz histogram), number of objects (again, available via asadm), peak read TPS and peak write TPS (how does the 60Ktps you mentioned split between reads and writes?).
Check your logs for your cache-read-pct values. If they're in the range of 10% or higher you should be raising your post-write-queue value to get better read latencies (and also reduce IOPS pressure from the drive).

Hierarchical quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. The documentation here
gives an example but I am still not quiet sure I understand it. My question is, if I have a two node Zookeeper cluster (I know it is not recommended but let's consider it for the sake of this example)
server.1 and
server.2,
can I have hierarchical quorums as follows:
group.1=1:2
weight.1=2
weight.2=2
With the above configuration:
Even if one node goes down I still have enough votes (?) to
maintain a quorum ? is this a correct statement ?
What is the zookeeper quorum value here (2 - for two nodes or 3 -
for 4 votes)
In a second example, say I have:
group.1=1:2
weight.1=2
weight.2=1
In this case if server.2 goes down,
Should I still have sufficient votes (2) to maintain a quorum ?
As far as I understand from the documentation, When we give weight to a node, then the majority varies from being the number of nodes. For example, if there are 10 nodes and 3 of the nodes have been given 70 percent of weightage, then it is enough to have those three nodes active in the network. Hence,
You don't have enough majority since both nodes have equal weight of 2. So, if one node goes down, we have only 50 percent of the network being active. Hence quorum is not achieved.
Since total weight is 4. we require 70 percent of 4 which would be 2.8 so closely 3, since we have only two nodes, both needs to be active to meet the quorum.
In the second example, it is clear from the weights given that 2/3 of the network would be enough (depends on the configuration set by us, I would assume 70 percent always,) if 65 percent is enough to say that network is alive, then the quorum is reached with one node which has weightage 2.

FAT12 - reading first cluster number of file from root directory

In the root directory of FAT12, bytes 26-27 represent the number of the first cluster of the file. However, cluster numbers in FAT12 are 12 bits long. So what part of that 2 byte entry in the root directory contains the actual 12 bit cluster number ? Is there any conversion that needs to be performed on reading those 2 bytes to get the cluster ? I have looked around over the Internet, but cant find a proper explanation for this.
The lowest 12 bits, i.e. you do an & 0x0FFF in your code. But on the other hand, the full 16 bits are used – the other 4 bits are just filled with 0, so the number is a valid word (16-bit integer).