Understanding "Number of keys" in nodetool cfstats - database-performance

I am new to Cassandra, in this example i am using a cluster with 1 DC and 5 nodes and a NetworkTopologyStrategy with replication factor as 3.
Keyspace: activityfeed
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Table: feed_shubham
SSTable count: 1
Space used (live), bytes: 52620684
Space used (total), bytes: 52620684
SSTable Compression Ratio: 0.3727660543119897
Number of keys (estimate): 137984
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 0
Local read count: 0
Local read latency: 0.000 ms
Local write count: 0
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used, bytes: 174416
Compacted partition minimum bytes: 771
Compacted partition maximum bytes: 924
Compacted partition mean bytes: 924
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
What does Number of keys here mean?
I have 5 different nodes in my cluster, and after firing the below command on each node separately i get different statistic for the same table.
nodetool cfstats -h 192.168.1.12 activityfeed.feed_shubham
As per the output above i can interpret that cfstats gives me stats regarding the physical storage of data on each node.
And i went through the below doc
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFstats.html
But i did not find the explanation for number of keys in there.
I am using a RandomPartitioner.
Is this key anything to do with the Partition key?
I have around 200000 record in my table.

The number of keys represents the number of partition keys on that node for the table. Its just an estimate though, and based on your version of C* its more accurate. Before 2.1.6 it summed the number of partitions listed in index file per sstable. Afterwards it merges a sketch of the data (hyperloglog) thats stored per sstable.

This value seems to indicate the total number of columns/cells in all local sstables. I guess it should be rather named "SSTable cell count" just as the corresponding memtable value. However, as sstables store redundant data before compaction, this value will not necessarily correspond to the actual number of columns returned as part of a result set.

Related

Slow queries only on slaves (postgresql)

I am using the largest size Postgresql Aurora Cluster.(Engine Version 13.3)
When only the Writer Cluster Endpoint was used, no slow queries occurred at all.
However, when Read Query is modified to use Reader Instance, Slow Query occurs.
(Sometimes queries take more than 10 seconds. And when executed in Writer, it ends in milliseconds, but in Reader, it takes more than 1 second.)
When the exact same Query is executed in the Master, the query is fast. Even if you look at the query plan, Reader and Writer are the same.
Cache hit metrics did not go down in AWS monitoring metrics.
What should be checked to determine the cause and fix it?
Highest Time Node Info
First Query on Master
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 5.936
Actual Total Time 50.491
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 3917
Shared Read Blocks 396
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 870.529
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 151.473
*Actual Cost 15609.54
*Slowest Node (by duration) false
*Largest Node (by rows) true
*Costiest Node (by cost) true
First Query on Slave
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 3046.454
Actual Total Time 3387.273
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 1096
Shared Read Blocks 4191
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 8763.617
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 10161.819
*Actual Cost 15609.54
*Slowest Node (by duration) true
*Largest Node (by rows) true
*Costiest Node (by cost) true
Second Query on Slave
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 3.734
Actual Total Time 8.375
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 4039
Shared Read Blocks 2
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 3.538
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 25.125
*Actual Cost 15609.54
*Slowest Node (by duration) false
*Largest Node (by rows) true
*Costiest Node (by cost) true

Handling of cassandra blocking writes when exceeds the memtable_cleanup_threshold

I was reading through the cassandra flushing strategies and came across following statement -
If the data to be flushed exceeds the memtable_cleanup_threshold, Cassandra blocks writes until the next flush succeeds.
Now my query is, let say we have insane writes to cassandra about 10K records per second and application is running 24*7. What should be the settings that we should make in following parameters to avoid blocking.
memtable_heap_space_in_mb
memtable_offheap_space_in_mb
memtable_cleanup_threshold
& Since it is a Time Series Data , do I need to make any changes with Compaction Strategy as well. If yes, what should be best for my case.
My spark application which is taking data from kafka and continuously inserting into Cassandra gets hang after particular time and I have analysed at that moment, there are lot of pending tasks in nodetool compactionstats.
nodetool tablehistograms
% SSTables WL RL P Size Cell Count
(ms) (ms) (bytes)
50% 642.00 88.15 25109.16 310 24
75% 770.00 263.21 668489.53 535 50
95% 770.00 4055.27 668489.53 3311 310
98% 770.00 8409.01 668489.53 73457 6866
99% 770.00 12108.97 668489.53 219342 20501
Min 4.00 11.87 20924.30 150 9
Max 770.00 1996099.05 668489.53 4866323 454826
Keyspace : trackfleet_db
Read Count: 7183347
Read Latency: 15.153115504235004 ms
Write Count: 2402229293
Write Latency: 0.7495135263492935 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 3307
Space used (live): 62736956804
Space used (total): 62736956804
Space used by snapshots (total): 10469827269
Off heap memory used (total): 56708763
SSTable Compression Ratio: 0.38214618375483633
Number of partitions (estimate): 493571
Memtable cell count: 2089
Memtable data size: 1168808
Memtable off heap memory used: 0
Memtable switch count: 88033
Local read count: 765497
Local read latency: 162.880 ms
Local write count: 782044138
Local write latency: 1.859 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 368
Bloom filter false ratio: 0.00000
Bloom filter space used: 29158176
Bloom filter off heap memory used: 29104216
Index summary off heap memory used: 7883835
Compression metadata off heap memory used: 19720712
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 7626
Average live cells per slice (last five minutes): 3.5
Maximum live cells per slice (last five minutes): 6
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 359
After changing the Compaction Strategy :-
Keyspace : trackfleet_db
Read Count: 8568544
Read Latency: 15.943608060365916 ms
Write Count: 2568676920
Write Latency: 0.8019530641630868 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 5843
SSTables in each level: [5842/4, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 71317936302
Space used (total): 71317936302
Space used by snapshots (total): 10469827269
Off heap memory used (total): 105205165
SSTable Compression Ratio: 0.3889946058934169
Number of partitions (estimate): 542002
Memtable cell count: 235
Memtable data size: 131501
Memtable off heap memory used: 0
Memtable switch count: 93947
Local read count: 768148
Local read latency: NaN ms
Local write count: 839003671
Local write latency: 1.127 ms
Pending flushes: 1
Percent repaired: 0.0
Bloom filter false positives: 1345
Bloom filter false ratio: 0.00000
Bloom filter space used: 54904960
Bloom filter off heap memory used: 55402400
Index summary off heap memory used: 14884149
Compression metadata off heap memory used: 34918616
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 4478
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 660
Thanks,
I would not touch the memtable settings unless its a problem. They will only really block if your writing at a rate that exceeds your disks ability to write or GCs are messing up timings. "10K records per second and application is running 24*7" -- isn't actually that much given the records are not very large in size and will not overrun writes (a decent system can do 100k-200k/s constant load). nodetool tablestats, tablehistograms, and schema can help identify if your records are too big, partitions too wide and give better indicator of what your compaction strategy should be (probably TWCS but maybe LCS if you have any reads at all and partitions span a day or so).
pending tasks in nodetool compactionstats has nothing to do memtable settings really either as its more that your compactions not keeping up. This can be just something like spikes as bulk jobs run, small partitions flush, or repairs stream sstables over but if it grows instead of going down you need to tune your compaction strategy. Really a lot depends on data model and stats (tablestats/tablehistograms)
you may refer this link to tune above parameters. http://abiasforaction.net/apache-cassandra-memtable-flush/
memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup.
memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers +
1). By default this is essentially 33% of your
memtable_heap_space_in_mb. A scheduled cleanup results in flushing of
the table/column family that occupies the largest portion of memtable
space. This keeps happening till your available memtable memory drops
below the cleanup threshold.

How large does the FAT structure and how large is the file?

Consider the following parameters of a FAT based lesystem:
Blocks are 8KB (213 bytes) large
FAT entries are 32 bits wide, of which 24 bits are used to store a block address
A. How large does the FAT structure need to be accommodate a 1GB (2^30 bytes) disk?
B. What is the largest theoretical le size supported by the FAT structure from part (A)?
A. How large does the FAT structure need to be accommodate a 1GB (2^30 bytes) disk?
The FAT file system splits the space into clusters, then has a table (the "cluster allocation table" or FAT) with an entry for each cluster (to say if it's free, faulty or which cluster is the next cluster in a chain of clusters). To work out size of the "cluster allocation table" divide the total size of the volume by the size of a cluster (to determine how many clusters and how many entries in the "cluster allocation table"), then multiply by the size of one entry, then maybe round up to a multiple of the cluster size or not (depending on which answer you want - actual size or space consumed).
B. What is the largest theoretical le size supported by the FAT structure from part (A)?
The largest file size supported is determined by either (whichever is smaller):
the size of "file size" field in the file's directory entry (which is 32-bit for FAT32 and would therefore be 4 GiB); or
the total size of the space minus the space consumed by the hidden/reserved/system area, cluster allocation table, directories and faulty clusters.
For a 1 GiB volume formatted with FAT32, the max. size of a file would be determined by the latter ("total space - sum of areas not usable by the file").
Note that if you have a 1 GiB disk, this might (e.g.) be split into 4 partitions and a FAT file system might be given a partition with a fraction of 1 GiB of space. Even if there is only one partition for the "whole" disk, typically (assuming "MBR partitions" and not the newer "GPT partitions" which takes more space for partition tables, etc) the partition begins on the second track (the first track is "reserved" for MBR, partition table and maybe "boot manager") or a later track (e.g. to align the start of the partition to a "4 KiB physical sector size" and avoid performance problems caused by "512 logical sector size").
In other words, the size of the disk has very little to do with the size of the volume used for FAT; and when questions only tell you the size of the disk and don't tell you the size of the partition/volume you can't provide accurate answers.
What you could do is state your assumptions clearly in your answer, for example:
"I assume that a "1 GB" disk is 1000000 KiB (1024000000 bytes, and not 1 GiB or 1073741824 bytes, and not 1 GB or 1000000000 bytes); and I assume that 1 MiB (1024 KiB) of disk space is consumed by the partition table and MBR and all remaining space is used for a single FAT partition; and therefore the FAT volume itself is 998976 KiB."

Why does a monotonically increasing shard key causes inserts to be routed to the chunk with maxkey as upper bound?

I am read in the MongoDB documentation about Shard Keys, the following"
If the shard key value is always increasing, all new inserts are routed to the chunk with maxKey as the upper bound. If the shard key value is always decreasing, all new inserts are routed to the chunk with minKey as the lower bound.
[https://docs.mongodb.com/manual/core/sharding-shard-key/#sharding-shard-key-creation][1]
I don't understand why. Let's say that the shard key ranges from 0 to 75, and is partitioned into 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75. Here the minKey is 0, and maxKey is 75.
If consecutive insert operations occur in a monotonically increasing way, say for example , 1,2,3,4,5 why would these inserts that are addressing shared key values of 1,2,3,4,5 (monotonically increasing), be routed to the last shard that goes from 51 to 75, the last shard? (this is the shard that includes the chunk with the maxKey as the upperbound) ?
Thank you
Let's say that the shard key ranges from 0 to 75 as you said with 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75.
If you insert 1, 2, 3, 4, 5 they'll go all to first chunk. Not to last one. Since they are in the range of 0 to 24.
What the documentation talks about is when you have keys that are always increasing e.g. from 0 to Infinity.
In this case from 0 to Infinity there may be 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75. But since the value of the key is always be increasing to Infinity all values > 50 will go to the third chunk and <= 50 values will to to first 2 chunks.
In sharded clusters there is a Balancer that will try to balance your chunks so they have same amount of data. With keys that are always increasing it can be difficult for the balancer to take a decision about how to split the chunks. So at the first time all inserts with values > 50 are routed to the chunk with maxKey as the upper bound i.e. third chunk.

Can not get rid of tombstones in cassandra 2.1.8 using (STCS) SizeTieredCompactionStrategy

I have a 3 nodes cassandra (2.1.8) cluster on which I am running application using titan db (v0.5.4). The amount of data is very small (<20 MB) but as my use case require deletes from time to time I already have problems with tombstones.
I can not get rid of already created tombstones.
The solutions I tried are:
lowering gc_grace for the specified graphindex table to 60s
run nodetool flush
run nodetool repair
for titan.graphindex table set compaction options as {'class': 'SizeTieredCompactionStrategy', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '0', 'tombstone_threshold': '0.1'};
running forceUserDefinedCompaction from jmx.
As a result the statistics lowered a bit but Average tombstones per slice and Maximum tombstones per slice are still not satisfying:
Table: graphindex
**SSTable count: 1**
Space used (live): 661873
Space used (total): 661873
Space used by snapshots (total): 0
Off heap memory used (total): 6544
SSTable Compression Ratio: 0.6139286819777781
Number of keys (estimate): 4082
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 15
Local read count: 25983
Local read latency: 0.931 ms
Local write count: 23610
Local write latency: 0.057 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 5208
Bloom filter off heap memory used: 5200
Index summary off heap memory used: 1248
Compression metadata off heap memory used: 96
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 152321
Compacted partition mean bytes: 203
Average live cells per slice (last five minutes): 728.4188892737559
Maximum live cells per slice (last five minutes): 4025.0
**Average tombstones per slice (last five minutes): 317.34938228841935**
**Maximum tombstones per slice (last five minutes): 8031.0**
Is there any option to remove all tombstones?. Thanks in advance for any suggestion.
The problem is solved.
It turned out that the information about the statistics is very misleading as the 'Average tombstones per slice (last five minutes)' and 'Maximum tombstones per slice (last five minutes)' and probably live cells statistics are not counted in last 5 minutes is it is written by nodetool cfstats. But they are calculated since the node startup. My nodes were running for few months so even though the tombstones were cleared I could not notice big difference as the scale of days with already high statistic values was so big. After I restarted the nodes the statistics cleared up and I could see that the compaction took effect.
Its a shame that the information about this bug in statistic description was so hard to find for me (https://issues.apache.org/jira/browse/CASSANDRA-7731)
Hope this could help someone to get to this information sooner.