Calculate Max over time on sum function in Prometheus - kubernetes

I am running prometheus in my kubernetes cluster.
I have the following system in kubernetes:
I have 4 nodes. I want to calculate free memory. I want to have the summation of those four nodes. Then I want to find the maximum over 1 day. So, for example,
at time=t1
node1: 500 MB
node2: 600 MB
node3: 200 MB
node4: 300 MB
Total = 1700 MB
at time=t2
node1: 400 MB
node2: 700 MB
node3: 100 MB
node4: 200 MB
Total = 1300 MB
at time=t3
node1: 600 MB
node2: 800 MB
node3: 1200 MB
node4: 1300 MB
Total = 3900 MB
at time=t4
node1: 100 MB
node2: 200 MB
node3: 300 MB
node4: 400 MB
Total = 1000 MB
So, The answer to my query should be 3900 MB. I am not able to do max_over_time for the sum.
I have done like this(which is not working at all):
max_over_time(sum(node_memory_MemFree)[2m])

Since version 2.7 (Jan 2019), Prometheus supports sub-queries:
max_over_time( sum(node_memory_MemFree_bytes{instance=~"foobar.*"})[1d:1h] )
(metric for the past 2 days, with a resolution of 1 hour.)
Read the documentation for more informations on using recording rules: https://prometheus.io/docs/prometheus/latest/querying/examples/#subquery
However, do note the blog recommendation:
Epilogue
Though subqueries are very convenient to use in place of recording rules, using them unnecessarily has performance implications. Heavy subqueries should eventually be converted to recording rules for efficiency.
It is also not recommended to have subqueries inside a recording rule. Rather create more recording rules if you do need to use subqueries in a recording rule.
The use of recording rules is explained in brian brazil article:
https://www.robustperception.io/composing-range-vector-functions-in-promql/

This isn't possible in one expression, you need to use a recording rule for the intermediate expression. See
https://www.robustperception.io/composing-range-vector-functions-in-promql/

Related

Handling of cassandra blocking writes when exceeds the memtable_cleanup_threshold

I was reading through the cassandra flushing strategies and came across following statement -
If the data to be flushed exceeds the memtable_cleanup_threshold, Cassandra blocks writes until the next flush succeeds.
Now my query is, let say we have insane writes to cassandra about 10K records per second and application is running 24*7. What should be the settings that we should make in following parameters to avoid blocking.
memtable_heap_space_in_mb
memtable_offheap_space_in_mb
memtable_cleanup_threshold
& Since it is a Time Series Data , do I need to make any changes with Compaction Strategy as well. If yes, what should be best for my case.
My spark application which is taking data from kafka and continuously inserting into Cassandra gets hang after particular time and I have analysed at that moment, there are lot of pending tasks in nodetool compactionstats.
nodetool tablehistograms
% SSTables WL RL P Size Cell Count
(ms) (ms) (bytes)
50% 642.00 88.15 25109.16 310 24
75% 770.00 263.21 668489.53 535 50
95% 770.00 4055.27 668489.53 3311 310
98% 770.00 8409.01 668489.53 73457 6866
99% 770.00 12108.97 668489.53 219342 20501
Min 4.00 11.87 20924.30 150 9
Max 770.00 1996099.05 668489.53 4866323 454826
Keyspace : trackfleet_db
Read Count: 7183347
Read Latency: 15.153115504235004 ms
Write Count: 2402229293
Write Latency: 0.7495135263492935 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 3307
Space used (live): 62736956804
Space used (total): 62736956804
Space used by snapshots (total): 10469827269
Off heap memory used (total): 56708763
SSTable Compression Ratio: 0.38214618375483633
Number of partitions (estimate): 493571
Memtable cell count: 2089
Memtable data size: 1168808
Memtable off heap memory used: 0
Memtable switch count: 88033
Local read count: 765497
Local read latency: 162.880 ms
Local write count: 782044138
Local write latency: 1.859 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 368
Bloom filter false ratio: 0.00000
Bloom filter space used: 29158176
Bloom filter off heap memory used: 29104216
Index summary off heap memory used: 7883835
Compression metadata off heap memory used: 19720712
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 7626
Average live cells per slice (last five minutes): 3.5
Maximum live cells per slice (last five minutes): 6
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 359
After changing the Compaction Strategy :-
Keyspace : trackfleet_db
Read Count: 8568544
Read Latency: 15.943608060365916 ms
Write Count: 2568676920
Write Latency: 0.8019530641630868 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 5843
SSTables in each level: [5842/4, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 71317936302
Space used (total): 71317936302
Space used by snapshots (total): 10469827269
Off heap memory used (total): 105205165
SSTable Compression Ratio: 0.3889946058934169
Number of partitions (estimate): 542002
Memtable cell count: 235
Memtable data size: 131501
Memtable off heap memory used: 0
Memtable switch count: 93947
Local read count: 768148
Local read latency: NaN ms
Local write count: 839003671
Local write latency: 1.127 ms
Pending flushes: 1
Percent repaired: 0.0
Bloom filter false positives: 1345
Bloom filter false ratio: 0.00000
Bloom filter space used: 54904960
Bloom filter off heap memory used: 55402400
Index summary off heap memory used: 14884149
Compression metadata off heap memory used: 34918616
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 4478
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 660
Thanks,
I would not touch the memtable settings unless its a problem. They will only really block if your writing at a rate that exceeds your disks ability to write or GCs are messing up timings. "10K records per second and application is running 24*7" -- isn't actually that much given the records are not very large in size and will not overrun writes (a decent system can do 100k-200k/s constant load). nodetool tablestats, tablehistograms, and schema can help identify if your records are too big, partitions too wide and give better indicator of what your compaction strategy should be (probably TWCS but maybe LCS if you have any reads at all and partitions span a day or so).
pending tasks in nodetool compactionstats has nothing to do memtable settings really either as its more that your compactions not keeping up. This can be just something like spikes as bulk jobs run, small partitions flush, or repairs stream sstables over but if it grows instead of going down you need to tune your compaction strategy. Really a lot depends on data model and stats (tablestats/tablehistograms)
you may refer this link to tune above parameters. http://abiasforaction.net/apache-cassandra-memtable-flush/
memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup.
memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers +
1). By default this is essentially 33% of your
memtable_heap_space_in_mb. A scheduled cleanup results in flushing of
the table/column family that occupies the largest portion of memtable
space. This keeps happening till your available memtable memory drops
below the cleanup threshold.

aerospike bad latencies with aws

We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.
Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.
The version in aws is aerospike-server-community-3.15.0.2 the version in Sl is Aerospike Enterprise Edition 3.6.3
Our config as follows
#Aerospike database configuration file.
service {
user xxxxx
group xxxxx
run-as-daemon
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
#Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
port 13000
address h1 reuse-address
}
heartbeat {
mode mesh
port 13001
address h1
mesh-seed-address-port h1 13001
mesh-seed-address-port h2 13001
mesh-seed-address-port h3 13001
mesh-seed-address-port h4 13001
interval 150
timeout 10
}
fabric {
port 13002
address h1
}
info {
port 13003
address h1
}
}
namespace XXXX {
replication-factor 2
memory-size 27G
default-ttl 10d
high-water-memory-pct 70
high-water-disk-pct 60
stop-writes-pct 90
storage-engine device {
device /dev/nvme0n1
scheduler-mode noop
write-block-size 128K
}
}
What should be done to bring down latencies in aws?
This comes down to the difference in the performance characteristics of the SSDs of the i3 nodes, compared to what you had on Softlayer. If you ran Aerospike on a floppy disk you'd get 0.5TPS.
Piyush's comment mentions ACT, the open source tool Aerospike has created to benchmark SSDs with real database workloads. The point of ACT is to find the sustained rate in which the SSD can be relied on to deliver the latency you want. Burst rates don't matter much for databases.
The performance engineering team at Aerospike has used ACT to find what the i3 1900G SSD can do, and published the results in a post. Its ACT rating is 4x, meaning that the full 1900G SSD can do 8Ktps reads, 4Ktps writes with the standard 1.5K object size, 128K block size, and stay at 95% < 1ms, 99% < 8ms, 99.9% < 64ms. This is not particularly good for an SSD. By comparison, a Micron 9200 PRO rates at 94.5x, nearly 24 times higher TPS load. What more, with the i3.xlarge you're sharing half that drive with a neighbor. There's no way to cap the IOPS so that you each get half, there's only a partition of the storage. This means that you can expect latency spikes originating in the neighbor. The i3.2xlarge is the smallest instance that gives you the entire SSD.
So, you take the ACT information and you use it to do capacity planning. The main factors you need to know are the average object size (you can find that using objsz histogram), number of objects (again, available via asadm), peak read TPS and peak write TPS (how does the 60Ktps you mentioned split between reads and writes?).
Check your logs for your cache-read-pct values. If they're in the range of 10% or higher you should be raising your post-write-queue value to get better read latencies (and also reduce IOPS pressure from the drive).

GCE SSD persistent disk slower than Standard persistent disk

We are using GCE for a MongoDB replica set with three members. As our data is quite large the initial sync for a new member is taking quite a lot. In our case the initial sync takes 7 hours for copying records and then 30 hours to create indexes.
The database is stored on a separate disk with these properties (copy-paste from the GCE console):
Type: Standard persistent disk
Size: 2000 GB
Zone: us-central1-c
Sustained random IOPS limit - estimated (R/W): 1,500 / 3,000
Sustained throughput limit (MB/s) - estimated (R/W): 180 / 120
To speed up we tried to add an SSD disk:
Type: SSD persistent disk
Size: 1000 GB
Zone: us-central1-c
Sustained random IOPS limit - estimated (R/W): 15,000 / 15,000
Sustained throughput limit (MB/s) - estimated: 240 / 240
One would expect that SSD disk should be quite faster than a Standard disk. But our results are different. During the initial MongoDB sync Standard disk was several time faster than the SSD. While it took 7 hours for the Standard disk to copy all data, the SSD disk after 12 hours had copied just half of data. We used Linux tool iostat and measured, Standard disk is achieving around 80,000 kB_wrtn/s while the SSD disk is around 8,000 kB_wrtn/s. How is possible that SSD disks is 10 times slower than the Standard disk?

Kafka 0.10 Replication Performance Decrease

I am trying to benchmark Kafka Cluster. I am newbie. I build 3 node-cluster. Each node has one partition. I did not change default broker settings. I just copied producer and consumer code directly from official website.
When i create topic with replication 1 and partition 3, i was able to 170 MB per sec. throughput. When i create topic with replication 3 and parititon 3, i hardly see 30 MB per seconds throughput.
Then i applied production config in this link https://kafka.apache.org/documentation#prodconfig. The result got worse.
Can you share your experience with me?
disk type replication insert count one message length elapsed time req per sec concurreny throughput MB
hdd 1 10,000,000 250 25 400,000 1 95.36743164
hdd 1 10,000,000 250 28 357,000 2 170.2308655
hdd 1 10,000,000 250 55 175,000 4 166.8930054
hdd 1 1,000,000 250 22 45,400 8 86.59362793
hdd 1 10,000,000 250 22 85,000 8 162.1246338
hdd 3 1,000,000 250 10 100,000 1 23.84185791
hdd 3 1,000,000 250 19 55,000 2 26.2260437
hdd 3 1,000,000 250 30 32,000 4 30.51757813
hdd 3 1,000,000 250 45 20,000 8 38.14697266
hdd 3 10,000,000 250 559 18,000 8 34.33227539
You should expect performance to decrease when increasing replication. You're initial run had such high throughput because Kafka didn't need to copy the message data to multiple different partitions. When you increase the replication factor you're basically trading speed for durability.

Units of perf stat statistics

I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below
31691336329 L1-dcache-loads
44227451 L1-dcache-load-misses
15596746809 L1-dcache-stores
20575093 L1-dcache-store-misses
26542169 cache-references
13410669 cache-misses
36859313200 cycles
75952288765 instructions
26542163 cache-references
what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.
The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.
31 691 336 329 L1-dcache-loads
44 227 451 L1-dcache-load-misses
15 596 746 809 L1-dcache-stores
20 575 093 L1-dcache-store-misses
26 542 169 cache-references
13 410 669 cache-misses
Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program
36 859 313 200 cycles
This is total count of instructions, executed from your program:
75 952 288 765 instructions
(I will use G suffix as abbreviation for billion)
From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.
In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.
Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html
perf stat -d md5sum *
578.920753 task-clock # 0.995 CPUs utilized
211 context-switches # 0.000 M/sec
4 CPU-migrations # 0.000 M/sec
212 page-faults # 0.000 M/sec
1,744,441,333 cycles # 3.013 GHz [20.22%]
1,064,408,505 stalled-cycles-frontend # 61.02% frontend cycles idle [30.68%]
104,014,063 stalled-cycles-backend # 5.96% backend cycles idle [41.00%]
2,401,954,846 instructions # 1.38 insns per cycle
# 0.44 stalled cycles per insn [51.18%]
14,519,547 branches # 25.080 M/sec [61.21%]
109,768 branch-misses # 0.76% of all branches [61.48%]
266,601,318 L1-dcache-loads # 460.514 M/sec [50.90%]
13,539,746 L1-dcache-load-misses # 5.08% of all L1-dcache hits [50.21%]
0 LLC-loads # 0.000 M/sec [39.19%]
(wrongevent?)0 LLC-load-misses # 0.00% of all LL-cache hits [ 9.63%]
0.581869522 seconds time elapsed
UPDATE Apr 18, 2014
please explain why cache-references are not correlating with L1-dcache numbers
Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.
44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).