Why does a monotonically increasing shard key causes inserts to be routed to the chunk with maxkey as upper bound? - mongodb

I am read in the MongoDB documentation about Shard Keys, the following"
If the shard key value is always increasing, all new inserts are routed to the chunk with maxKey as the upper bound. If the shard key value is always decreasing, all new inserts are routed to the chunk with minKey as the lower bound.
[https://docs.mongodb.com/manual/core/sharding-shard-key/#sharding-shard-key-creation][1]
I don't understand why. Let's say that the shard key ranges from 0 to 75, and is partitioned into 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75. Here the minKey is 0, and maxKey is 75.
If consecutive insert operations occur in a monotonically increasing way, say for example , 1,2,3,4,5 why would these inserts that are addressing shared key values of 1,2,3,4,5 (monotonically increasing), be routed to the last shard that goes from 51 to 75, the last shard? (this is the shard that includes the chunk with the maxKey as the upperbound) ?
Thank you

Let's say that the shard key ranges from 0 to 75 as you said with 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75.
If you insert 1, 2, 3, 4, 5 they'll go all to first chunk. Not to last one. Since they are in the range of 0 to 24.
What the documentation talks about is when you have keys that are always increasing e.g. from 0 to Infinity.
In this case from 0 to Infinity there may be 3 chunks. First chunk from 0 to 24, second from 25 to 50, and third from 51 to 75. But since the value of the key is always be increasing to Infinity all values > 50 will go to the third chunk and <= 50 values will to to first 2 chunks.
In sharded clusters there is a Balancer that will try to balance your chunks so they have same amount of data. With keys that are always increasing it can be difficult for the balancer to take a decision about how to split the chunks. So at the first time all inserts with values > 50 are routed to the chunk with maxKey as the upper bound i.e. third chunk.

Related

number of misses in spatial locality

I am confused how the number of misses calculated in the example below (from Computer Architecture: A Quantitative Approach)
Example:
For the code below, determine which accesses are likely to cause data cache
misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the
number of prefetch instructions executed and the misses avoided by prefetching.
Let’s assume we have an 8 KB direct-mapped data cache with 16-byte blocks,
and it is a write-back cache that does write allocate. The elements of a and b are 8
bytes long since they are double-precision floating-point arrays. There are 3 rows
and 100 columns for a and 101 rows and 3 columns for b. Let’s also assume they
are not in the cache at the start of the program.
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 100; j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
Answer:
The compiler will first determine which accesses are likely to cause cache
misses; otherwise, we will waste time on issuing prefetch instructions for data
that would be hits. Elements of a are written in the order that they are stored in
memory, so a will benefit from spatial locality: The even values of j will miss
and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will
lead to 3 × (100/2), or 150 misses.
How 150 misses calculated or why divided 100 by 2?

What is the maximum size of symbol data type in KDB+?

I cannot find the maximum size of the symbol data type in KDB+.
Does anyone know what it is?
If youa re talking the physical length of a symbol, well symbols exist as interred strings in kdb, so the maximum string length limit would apply. As strings are just a list of characters in kdb, the maximum size of a string would be the maximum length of a list. In 3.x this would be 264 - 1, In previous versions of kdb this limit was 2,000,000,000.
However there is a 2TB maximum serialized size limit that would likely kick in first, you can roughly work out the size of a sym by serializing it,
q)count -8!`
10
q)count -8!`a
11
q)count -8!`abc
13
So each character adds a single byte, this would give a roughly 1012 character length size limit
If you mean the maximum amount of symbols that can exist in memory, then the limit is 1.4B.

Handling of cassandra blocking writes when exceeds the memtable_cleanup_threshold

I was reading through the cassandra flushing strategies and came across following statement -
If the data to be flushed exceeds the memtable_cleanup_threshold, Cassandra blocks writes until the next flush succeeds.
Now my query is, let say we have insane writes to cassandra about 10K records per second and application is running 24*7. What should be the settings that we should make in following parameters to avoid blocking.
memtable_heap_space_in_mb
memtable_offheap_space_in_mb
memtable_cleanup_threshold
& Since it is a Time Series Data , do I need to make any changes with Compaction Strategy as well. If yes, what should be best for my case.
My spark application which is taking data from kafka and continuously inserting into Cassandra gets hang after particular time and I have analysed at that moment, there are lot of pending tasks in nodetool compactionstats.
nodetool tablehistograms
% SSTables WL RL P Size Cell Count
(ms) (ms) (bytes)
50% 642.00 88.15 25109.16 310 24
75% 770.00 263.21 668489.53 535 50
95% 770.00 4055.27 668489.53 3311 310
98% 770.00 8409.01 668489.53 73457 6866
99% 770.00 12108.97 668489.53 219342 20501
Min 4.00 11.87 20924.30 150 9
Max 770.00 1996099.05 668489.53 4866323 454826
Keyspace : trackfleet_db
Read Count: 7183347
Read Latency: 15.153115504235004 ms
Write Count: 2402229293
Write Latency: 0.7495135263492935 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 3307
Space used (live): 62736956804
Space used (total): 62736956804
Space used by snapshots (total): 10469827269
Off heap memory used (total): 56708763
SSTable Compression Ratio: 0.38214618375483633
Number of partitions (estimate): 493571
Memtable cell count: 2089
Memtable data size: 1168808
Memtable off heap memory used: 0
Memtable switch count: 88033
Local read count: 765497
Local read latency: 162.880 ms
Local write count: 782044138
Local write latency: 1.859 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 368
Bloom filter false ratio: 0.00000
Bloom filter space used: 29158176
Bloom filter off heap memory used: 29104216
Index summary off heap memory used: 7883835
Compression metadata off heap memory used: 19720712
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 7626
Average live cells per slice (last five minutes): 3.5
Maximum live cells per slice (last five minutes): 6
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 359
After changing the Compaction Strategy :-
Keyspace : trackfleet_db
Read Count: 8568544
Read Latency: 15.943608060365916 ms
Write Count: 2568676920
Write Latency: 0.8019530641630868 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 5843
SSTables in each level: [5842/4, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 71317936302
Space used (total): 71317936302
Space used by snapshots (total): 10469827269
Off heap memory used (total): 105205165
SSTable Compression Ratio: 0.3889946058934169
Number of partitions (estimate): 542002
Memtable cell count: 235
Memtable data size: 131501
Memtable off heap memory used: 0
Memtable switch count: 93947
Local read count: 768148
Local read latency: NaN ms
Local write count: 839003671
Local write latency: 1.127 ms
Pending flushes: 1
Percent repaired: 0.0
Bloom filter false positives: 1345
Bloom filter false ratio: 0.00000
Bloom filter space used: 54904960
Bloom filter off heap memory used: 55402400
Index summary off heap memory used: 14884149
Compression metadata off heap memory used: 34918616
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 4478
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 660
Thanks,
I would not touch the memtable settings unless its a problem. They will only really block if your writing at a rate that exceeds your disks ability to write or GCs are messing up timings. "10K records per second and application is running 24*7" -- isn't actually that much given the records are not very large in size and will not overrun writes (a decent system can do 100k-200k/s constant load). nodetool tablestats, tablehistograms, and schema can help identify if your records are too big, partitions too wide and give better indicator of what your compaction strategy should be (probably TWCS but maybe LCS if you have any reads at all and partitions span a day or so).
pending tasks in nodetool compactionstats has nothing to do memtable settings really either as its more that your compactions not keeping up. This can be just something like spikes as bulk jobs run, small partitions flush, or repairs stream sstables over but if it grows instead of going down you need to tune your compaction strategy. Really a lot depends on data model and stats (tablestats/tablehistograms)
you may refer this link to tune above parameters. http://abiasforaction.net/apache-cassandra-memtable-flush/
memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup.
memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers +
1). By default this is essentially 33% of your
memtable_heap_space_in_mb. A scheduled cleanup results in flushing of
the table/column family that occupies the largest portion of memtable
space. This keeps happening till your available memtable memory drops
below the cleanup threshold.

Can not get rid of tombstones in cassandra 2.1.8 using (STCS) SizeTieredCompactionStrategy

I have a 3 nodes cassandra (2.1.8) cluster on which I am running application using titan db (v0.5.4). The amount of data is very small (<20 MB) but as my use case require deletes from time to time I already have problems with tombstones.
I can not get rid of already created tombstones.
The solutions I tried are:
lowering gc_grace for the specified graphindex table to 60s
run nodetool flush
run nodetool repair
for titan.graphindex table set compaction options as {'class': 'SizeTieredCompactionStrategy', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '0', 'tombstone_threshold': '0.1'};
running forceUserDefinedCompaction from jmx.
As a result the statistics lowered a bit but Average tombstones per slice and Maximum tombstones per slice are still not satisfying:
Table: graphindex
**SSTable count: 1**
Space used (live): 661873
Space used (total): 661873
Space used by snapshots (total): 0
Off heap memory used (total): 6544
SSTable Compression Ratio: 0.6139286819777781
Number of keys (estimate): 4082
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 15
Local read count: 25983
Local read latency: 0.931 ms
Local write count: 23610
Local write latency: 0.057 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 5208
Bloom filter off heap memory used: 5200
Index summary off heap memory used: 1248
Compression metadata off heap memory used: 96
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 152321
Compacted partition mean bytes: 203
Average live cells per slice (last five minutes): 728.4188892737559
Maximum live cells per slice (last five minutes): 4025.0
**Average tombstones per slice (last five minutes): 317.34938228841935**
**Maximum tombstones per slice (last five minutes): 8031.0**
Is there any option to remove all tombstones?. Thanks in advance for any suggestion.
The problem is solved.
It turned out that the information about the statistics is very misleading as the 'Average tombstones per slice (last five minutes)' and 'Maximum tombstones per slice (last five minutes)' and probably live cells statistics are not counted in last 5 minutes is it is written by nodetool cfstats. But they are calculated since the node startup. My nodes were running for few months so even though the tombstones were cleared I could not notice big difference as the scale of days with already high statistic values was so big. After I restarted the nodes the statistics cleared up and I could see that the compaction took effect.
Its a shame that the information about this bug in statistic description was so hard to find for me (https://issues.apache.org/jira/browse/CASSANDRA-7731)
Hope this could help someone to get to this information sooner.

Understanding "Number of keys" in nodetool cfstats

I am new to Cassandra, in this example i am using a cluster with 1 DC and 5 nodes and a NetworkTopologyStrategy with replication factor as 3.
Keyspace: activityfeed
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Table: feed_shubham
SSTable count: 1
Space used (live), bytes: 52620684
Space used (total), bytes: 52620684
SSTable Compression Ratio: 0.3727660543119897
Number of keys (estimate): 137984
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 0
Local read count: 0
Local read latency: 0.000 ms
Local write count: 0
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used, bytes: 174416
Compacted partition minimum bytes: 771
Compacted partition maximum bytes: 924
Compacted partition mean bytes: 924
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
What does Number of keys here mean?
I have 5 different nodes in my cluster, and after firing the below command on each node separately i get different statistic for the same table.
nodetool cfstats -h 192.168.1.12 activityfeed.feed_shubham
As per the output above i can interpret that cfstats gives me stats regarding the physical storage of data on each node.
And i went through the below doc
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFstats.html
But i did not find the explanation for number of keys in there.
I am using a RandomPartitioner.
Is this key anything to do with the Partition key?
I have around 200000 record in my table.
The number of keys represents the number of partition keys on that node for the table. Its just an estimate though, and based on your version of C* its more accurate. Before 2.1.6 it summed the number of partitions listed in index file per sstable. Afterwards it merges a sketch of the data (hyperloglog) thats stored per sstable.
This value seems to indicate the total number of columns/cells in all local sstables. I guess it should be rather named "SSTable cell count" just as the corresponding memtable value. However, as sstables store redundant data before compaction, this value will not necessarily correspond to the actual number of columns returned as part of a result set.