Slow queries only on slaves (postgresql) - postgresql

I am using the largest size Postgresql Aurora Cluster.(Engine Version 13.3)
When only the Writer Cluster Endpoint was used, no slow queries occurred at all.
However, when Read Query is modified to use Reader Instance, Slow Query occurs.
(Sometimes queries take more than 10 seconds. And when executed in Writer, it ends in milliseconds, but in Reader, it takes more than 1 second.)
When the exact same Query is executed in the Master, the query is fast. Even if you look at the query plan, Reader and Writer are the same.
Cache hit metrics did not go down in AWS monitoring metrics.
What should be checked to determine the cause and fix it?
Highest Time Node Info
First Query on Master
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 5.936
Actual Total Time 50.491
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 3917
Shared Read Blocks 396
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 870.529
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 151.473
*Actual Cost 15609.54
*Slowest Node (by duration) false
*Largest Node (by rows) true
*Costiest Node (by cost) true
First Query on Slave
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 3046.454
Actual Total Time 3387.273
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 1096
Shared Read Blocks 4191
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 8763.617
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 10161.819
*Actual Cost 15609.54
*Slowest Node (by duration) true
*Largest Node (by rows) true
*Costiest Node (by cost) true
Second Query on Slave
Node Type Index Scan
Parent Relationship Outer
Parallel Aware true
Scan Direction Forward
Index Name trace_idx09
Schema public
Alias trace_
Startup Cost 0.69
Total Cost 15609.54
Plan Rows 1610
Plan Width 24
Actual Startup Time 3.734
Actual Total Time 8.375
Actual Rows 2902
Actual Loops 3
Rows Removed by Index Recheck 0
Shared Hit Blocks 4039
Shared Read Blocks 2
Shared Dirtied Blocks 0
Shared Written Blocks 0
Local Hit Blocks 0
Local Read Blocks 0
Local Dirtied Blocks 0
Local Written Blocks 0
Temp Read Blocks 0
Temp Written Blocks 0
I/O Read Time 3.538
I/O Write Time 0
Workers [object Object],[object Object]
*Planner Row Estimate Factor 1.8024844720496895
*Planner Row Estimate Direction 1
*Actual Duration 25.125
*Actual Cost 15609.54
*Slowest Node (by duration) false
*Largest Node (by rows) true
*Costiest Node (by cost) true

Related

Handling of cassandra blocking writes when exceeds the memtable_cleanup_threshold

I was reading through the cassandra flushing strategies and came across following statement -
If the data to be flushed exceeds the memtable_cleanup_threshold, Cassandra blocks writes until the next flush succeeds.
Now my query is, let say we have insane writes to cassandra about 10K records per second and application is running 24*7. What should be the settings that we should make in following parameters to avoid blocking.
memtable_heap_space_in_mb
memtable_offheap_space_in_mb
memtable_cleanup_threshold
& Since it is a Time Series Data , do I need to make any changes with Compaction Strategy as well. If yes, what should be best for my case.
My spark application which is taking data from kafka and continuously inserting into Cassandra gets hang after particular time and I have analysed at that moment, there are lot of pending tasks in nodetool compactionstats.
nodetool tablehistograms
% SSTables WL RL P Size Cell Count
(ms) (ms) (bytes)
50% 642.00 88.15 25109.16 310 24
75% 770.00 263.21 668489.53 535 50
95% 770.00 4055.27 668489.53 3311 310
98% 770.00 8409.01 668489.53 73457 6866
99% 770.00 12108.97 668489.53 219342 20501
Min 4.00 11.87 20924.30 150 9
Max 770.00 1996099.05 668489.53 4866323 454826
Keyspace : trackfleet_db
Read Count: 7183347
Read Latency: 15.153115504235004 ms
Write Count: 2402229293
Write Latency: 0.7495135263492935 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 3307
Space used (live): 62736956804
Space used (total): 62736956804
Space used by snapshots (total): 10469827269
Off heap memory used (total): 56708763
SSTable Compression Ratio: 0.38214618375483633
Number of partitions (estimate): 493571
Memtable cell count: 2089
Memtable data size: 1168808
Memtable off heap memory used: 0
Memtable switch count: 88033
Local read count: 765497
Local read latency: 162.880 ms
Local write count: 782044138
Local write latency: 1.859 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 368
Bloom filter false ratio: 0.00000
Bloom filter space used: 29158176
Bloom filter off heap memory used: 29104216
Index summary off heap memory used: 7883835
Compression metadata off heap memory used: 19720712
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 7626
Average live cells per slice (last five minutes): 3.5
Maximum live cells per slice (last five minutes): 6
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Dropped Mutations: 359
After changing the Compaction Strategy :-
Keyspace : trackfleet_db
Read Count: 8568544
Read Latency: 15.943608060365916 ms
Write Count: 2568676920
Write Latency: 0.8019530641630868 ms
Pending Flushes: 1
Table: locationinfo
SSTable count: 5843
SSTables in each level: [5842/4, 0, 0, 0, 0, 0, 0, 0, 0]
Space used (live): 71317936302
Space used (total): 71317936302
Space used by snapshots (total): 10469827269
Off heap memory used (total): 105205165
SSTable Compression Ratio: 0.3889946058934169
Number of partitions (estimate): 542002
Memtable cell count: 235
Memtable data size: 131501
Memtable off heap memory used: 0
Memtable switch count: 93947
Local read count: 768148
Local read latency: NaN ms
Local write count: 839003671
Local write latency: 1.127 ms
Pending flushes: 1
Percent repaired: 0.0
Bloom filter false positives: 1345
Bloom filter false ratio: 0.00000
Bloom filter space used: 54904960
Bloom filter off heap memory used: 55402400
Index summary off heap memory used: 14884149
Compression metadata off heap memory used: 34918616
Compacted partition minimum bytes: 150
Compacted partition maximum bytes: 4866323
Compacted partition mean bytes: 4478
Average live cells per slice (last five minutes): NaN
Maximum live cells per slice (last five minutes): 0
Average tombstones per slice (last five minutes): NaN
Maximum tombstones per slice (last five minutes): 0
Dropped Mutations: 660
Thanks,
I would not touch the memtable settings unless its a problem. They will only really block if your writing at a rate that exceeds your disks ability to write or GCs are messing up timings. "10K records per second and application is running 24*7" -- isn't actually that much given the records are not very large in size and will not overrun writes (a decent system can do 100k-200k/s constant load). nodetool tablestats, tablehistograms, and schema can help identify if your records are too big, partitions too wide and give better indicator of what your compaction strategy should be (probably TWCS but maybe LCS if you have any reads at all and partitions span a day or so).
pending tasks in nodetool compactionstats has nothing to do memtable settings really either as its more that your compactions not keeping up. This can be just something like spikes as bulk jobs run, small partitions flush, or repairs stream sstables over but if it grows instead of going down you need to tune your compaction strategy. Really a lot depends on data model and stats (tablestats/tablehistograms)
you may refer this link to tune above parameters. http://abiasforaction.net/apache-cassandra-memtable-flush/
memtable_cleanup_threshold – A percentage of your total available memtable space that will trigger a memtable cleanup.
memtable_cleanup_threshold defaults to 1 / (memtable_flush_writers +
1). By default this is essentially 33% of your
memtable_heap_space_in_mb. A scheduled cleanup results in flushing of
the table/column family that occupies the largest portion of memtable
space. This keeps happening till your available memtable memory drops
below the cleanup threshold.

Can not get rid of tombstones in cassandra 2.1.8 using (STCS) SizeTieredCompactionStrategy

I have a 3 nodes cassandra (2.1.8) cluster on which I am running application using titan db (v0.5.4). The amount of data is very small (<20 MB) but as my use case require deletes from time to time I already have problems with tombstones.
I can not get rid of already created tombstones.
The solutions I tried are:
lowering gc_grace for the specified graphindex table to 60s
run nodetool flush
run nodetool repair
for titan.graphindex table set compaction options as {'class': 'SizeTieredCompactionStrategy', 'unchecked_tombstone_compaction': 'true', 'tombstone_compaction_interval': '0', 'tombstone_threshold': '0.1'};
running forceUserDefinedCompaction from jmx.
As a result the statistics lowered a bit but Average tombstones per slice and Maximum tombstones per slice are still not satisfying:
Table: graphindex
**SSTable count: 1**
Space used (live): 661873
Space used (total): 661873
Space used by snapshots (total): 0
Off heap memory used (total): 6544
SSTable Compression Ratio: 0.6139286819777781
Number of keys (estimate): 4082
Memtable cell count: 0
Memtable data size: 0
Memtable off heap memory used: 0
Memtable switch count: 15
Local read count: 25983
Local read latency: 0.931 ms
Local write count: 23610
Local write latency: 0.057 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 5208
Bloom filter off heap memory used: 5200
Index summary off heap memory used: 1248
Compression metadata off heap memory used: 96
Compacted partition minimum bytes: 43
Compacted partition maximum bytes: 152321
Compacted partition mean bytes: 203
Average live cells per slice (last five minutes): 728.4188892737559
Maximum live cells per slice (last five minutes): 4025.0
**Average tombstones per slice (last five minutes): 317.34938228841935**
**Maximum tombstones per slice (last five minutes): 8031.0**
Is there any option to remove all tombstones?. Thanks in advance for any suggestion.
The problem is solved.
It turned out that the information about the statistics is very misleading as the 'Average tombstones per slice (last five minutes)' and 'Maximum tombstones per slice (last five minutes)' and probably live cells statistics are not counted in last 5 minutes is it is written by nodetool cfstats. But they are calculated since the node startup. My nodes were running for few months so even though the tombstones were cleared I could not notice big difference as the scale of days with already high statistic values was so big. After I restarted the nodes the statistics cleared up and I could see that the compaction took effect.
Its a shame that the information about this bug in statistic description was so hard to find for me (https://issues.apache.org/jira/browse/CASSANDRA-7731)
Hope this could help someone to get to this information sooner.

Understanding "Number of keys" in nodetool cfstats

I am new to Cassandra, in this example i am using a cluster with 1 DC and 5 nodes and a NetworkTopologyStrategy with replication factor as 3.
Keyspace: activityfeed
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Table: feed_shubham
SSTable count: 1
Space used (live), bytes: 52620684
Space used (total), bytes: 52620684
SSTable Compression Ratio: 0.3727660543119897
Number of keys (estimate): 137984
Memtable cell count: 0
Memtable data size, bytes: 0
Memtable switch count: 0
Local read count: 0
Local read latency: 0.000 ms
Local write count: 0
Local write latency: 0.000 ms
Pending tasks: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used, bytes: 174416
Compacted partition minimum bytes: 771
Compacted partition maximum bytes: 924
Compacted partition mean bytes: 924
Average live cells per slice (last five minutes): 0.0
Average tombstones per slice (last five minutes): 0.0
What does Number of keys here mean?
I have 5 different nodes in my cluster, and after firing the below command on each node separately i get different statistic for the same table.
nodetool cfstats -h 192.168.1.12 activityfeed.feed_shubham
As per the output above i can interpret that cfstats gives me stats regarding the physical storage of data on each node.
And i went through the below doc
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsCFstats.html
But i did not find the explanation for number of keys in there.
I am using a RandomPartitioner.
Is this key anything to do with the Partition key?
I have around 200000 record in my table.
The number of keys represents the number of partition keys on that node for the table. Its just an estimate though, and based on your version of C* its more accurate. Before 2.1.6 it summed the number of partitions listed in index file per sstable. Afterwards it merges a sketch of the data (hyperloglog) thats stored per sstable.
This value seems to indicate the total number of columns/cells in all local sstables. I guess it should be rather named "SSTable cell count" just as the corresponding memtable value. However, as sstables store redundant data before compaction, this value will not necessarily correspond to the actual number of columns returned as part of a result set.

Understanding simple PostgreSQL EXPLAIN

I can't understand EXPLAIN of quite simple query:
select * from store order by id desc limit 1;
QUERY PLAN Limit (cost=0.00..0.03 rows=1 width=31)
-> Index Scan Backward using store_pkey on store (cost=0.00..8593.28 rows=250000 width=31)
Why does top level node (limit) have cost lower than nested(index scan) has? As I read from documentation it should be cumulative cost, i.e. 8593.28 + 0.03
The docs (emphasis added) say;
Actually two numbers are shown: the start-up cost before the first row can be returned, and the total cost to return all the rows. For most queries the total cost is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
In other words, 8593.28 would be the cost to return all the rows, but due to the limit you're only returning one so the actual cost is much lower (more or less equal to the startup cost)
The numbers you see in the top node (0.00..0.03) are (per documentation)
0.00 .. the estimated start-up cost of the node
0.03 .. the estimated total cost of the node
If you want actual total times, run EXPLAIN ANALYZE, which appends actual times for every node. Like:
Limit (cost=0.29..0.31 rows=1 width=30) (actual time=xxx.xxx..xxx.xxx rows=1 loops=1)
Bold emphasis mine.

Interrupt time in DMA operation

I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .
Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.
The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50
My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?