PerfView: Opening GC Heap Net Mem Stats taking forever - perfview

I've got relatively small etl files (100MB taken together), but when I click on GC Heap Net Mem Stacks, and wait a long time (10 minutes +), it never comes back from 'gray state'.
Any ideas if this is 'normal' or not?

OK, the answer is just, yes, it takes a VERY long time:
Started: Opening GC Heap Net Mem (Coarse Sampling) Stacks
Produced 0.396K events
[Computing the processes involved in the trace.]
Completed: Opening GC Heap Net Mem (Coarse Sampling) Stacks (Elapsed Time: 0.103 sec)
Started: Opening GC Heap Net Mem Stacks
Produced 3.141K events
[Computing the processes involved in the trace.]
Completed: Opening GC Heap Net Mem Stacks (Elapsed Time: 1033.355 sec)

Related

PostgreSQL autovacuum causing significant performance degradation

Our Postgres DB (hosted on Google Cloud SQL with 1 CPU, 3.7 GB of RAM, see below) consists mostly of one big ~90GB table with about ~60 million rows. The usage pattern consists almost exclusively of appends and a few indexed reads near the end of the table. From time to time a few users get deleted, deleting a small percentage of rows scattered across the table.
This all works fine, but every few months an autovacuum gets triggered on that table, which significantly impacts our service's performance for ~8 hours:
Storage usage increases by ~1GB for the duration of the autovacuum (several hours), then slowly returns to the previous value (might eventually drop below it, due to the autovacuum freeing pages)
Database CPU utilization jumps from <10% to ~20%
Disk Read/Write Ops increases from near zero to ~50/second
Database Memory increases slightly, but stays below 2GB
Transaction/sec and ingress/egress bytes are also fairly unaffected, as would be expected
This has the effect of increasing our service's 95th latency percentile from ~100ms to ~0.5-1s during the autovacuum, which in turn triggers our monitoring. The service serves around ten requests per second, with each request consisting of a few simple DB reads/writes that normally have a latency of 2-3ms each.
Here are some monitoring screenshots illustrating the issue:
The DB configuration is fairly vanilla:
The log entry documenting this autovacuum process reads as follows:
system usage: CPU 470.10s/358.74u sec elapsed 38004.58 sec
avg read rate: 2.491 MB/s, avg write rate: 2.247 MB/s
buffer usage: 8480213 hits, 12117505 misses, 10930449 dirtied
tuples: 5959839 removed, 57732135 remain, 4574 are dead but not yet removable
pages: 0 removed, 6482261 remain, 0 skipped due to pins, 0 skipped frozen
automatic vacuum of table "XXX": index scans: 1
Any suggestions what we could tune to reduce the impact of future autovacuums on our service? Or are we doing something wrong?
If you can increase autovacuum_vacuum_cost_delay, your autovacuum would run slower and be less invasive.
However, it is usually the best solution to make it faster by setting autovacuum_vacuum_cost_limit to 2000 or so. Then it finishes faster.
You could also try to schedule VACUUMs of the table yourself at times when it hurts least.
But frankly, if a single innocuous autovacuum is enough to disturb your operation, you need more I/O bandwidth.

why would PostgreSQL connection establishment be CPU-bound

I have a C# backend running on AWS Lambda. It connects to a PostgreSQL DB in the same region.
I observed extremely slow cold-startup execution time after adding the DB connection code. After allocating more memory to my Lambda function, the execution time has significantly reduced.
The execution time as measured by Lambda console:
128 MB (~15.5 sec)
256 MB (~9 sec)
384 MB (~6 sec)
512 MB (~4.5 sec)
640 MB (~3.5 sec)
768 MB (~3 sec)
In contrast, after commenting out the DB connection code:
128 MB (~5 sec)
256 MB (~2.5 sec)
So opening a DB connection has contributed a lot to the execution time.
According to AWS documentation:
In the AWS Lambda resource model, you choose the amount of memory you
want for your function, and are allocated proportional CPU power and
other resources.
Since the peak memory usage has consistently stayed at ~45 MB, this phenomenon seems to suggest that database connection establishment is a computationally intensive activity. If so, why?
Code in question (I'm using Npgsql):
var conn = new NpgsqlConnection(Db.connStr);
conn.Open();
using(conn)
{ // print something }
Update 1
I set up a MariaDB instance with the same configuration and did some testing.
Using MySql.Data for synchronous operation:
128 MB (~12.5 sec)
256 MB (~6.5 sec)
Using MySqlConnector for asynchronous operation:
128 MB (~11 sec)
256 MB (~5 sec)
Interestingly, the execution time on my laptop has increased from ~4 sec (for Npgsql) to 12~15 sec (for MySql.Data and MySqlConnector).
At this point, I'll just allocate more memory to the Lambda function to solve this issue. But if anyone knows why the connection establishment took so long, I'd appreciate an answer. I might post some profiling results later.

Tuning autovacuum for update heavy tables

I have a very update heavy table.
The problem is that the table grows a lot, as the autovacuum can not catch up.
The autovacuum kick in every 2 minutes, so it is running file.
I have an application which is making like 50k updates(and a few inserts) every 60-70 seconds.
I am forced to do CLUSTER every week, which is not the best solution IMO.
My autovacuum settings are as follows (plus related):
maintenance_work_mem = 2GB
autovacuum_vacuum_cost_limit = 10000
autovacuum_naptime = 10
autovacuum_max_workers = 5
autovacuum_vacuum_cost_delay = 10ms
vacuum_cost_limit = 1000
autovacuum_vacuum_scale_factor=0.1 # this is set only for this table
Here is the autovacuum log:
2018-01-02 12:47:40.212 UTC [2289] LOG: automatic vacuum of table
"mydb.public.sla": index scans: 1 pages: 0 removed, 21853 remain, 1
skipped due to pins, 0 skipped frozen tuples: 28592 removed, 884848
remain, 38501 are dead but not yet removable buffer usage: 240395
hits, 15086 misses, 22314 dirtied avg read rate: 29.918 MB/s, avg
write rate: 44.252 MB/s system usage: CPU 0.47s/2.60u sec elapsed 3.93
sec
It seems to me that 29 MB/s is way to low, considering my hardware is pretty OK
I have Intel NVMe devices, and my I/O doesn't seem to be saturated.
Postgresql version: 9.6
CPU : Dual Intel(R) Xeon(R) CPU E5-2620 v4 # 2.10GHz
Any ideas how to improve this?

why is kdb process showing high memory usage on system?

I am running into serious memory issues with my kdb process. Here is the architecture in brief.
The process runs in slave mode (4 slaves). It loads a ton of data from database into memory initially (total size of all variables loaded in memory calculated from -22! is approx 11G). Initially this matches .Q.w[] and close to unix process memory usage. This data set increases by very little in incremental operations. However, after a long operation, although the kdb internal memory stats (.Q.w[]) show expected memory usage (both used and heap) ~ 13 G, the process is consuming close to 25G on the system (unix /proc, top) eventually running out of physical memory.
Now, when I run garbage collection manually (.Q.gc[]), it frees up memory and brings unix process usage close to heap number displayed by .Q.w[].
I am running Q 2.7 version with -g 1 option to run garbage collection in immediate mode.
Why is unix process usage so significantly differently from kdb internal statistic -- where is the difference coming from? Why is "-g 1" option not working? When i run a simple example, it works fine. But in this case, it seems to leak a lot of memory.
I tried with 2.6 version which is supposed to have automated garbage collection. Suprisingly, there is still a huge difference between used and heap numbers from .Q.w when running with version 2.6 both in single threaded (each) and multi threaded modes (peach). Any ideas?
I am not sure of the concrete answer but this is my deduction based on following information (and some practical experiments) which is mentioned on wiki:
http://code.kx.com/q/ref/control/#peach
It says:
Memory Usage
Each slave thread has its own heap, a minimum of 64MB.
Since kdb 2.7 2011.09.21, .Q.gc[] in the main thread executes gc in the slave threads too.
Automatic garbage collection within each thread (triggered by a wsful, or hitting the artificial heap limit as specified with -w on the command line) is only executed for that particular thread, not across all threads.
Symbols are internalized from a single memory area common to all threads.
My observations:
Thread Specific Memory:
.Q.w[] only shows stats of main thread and not the summation of all the threads (total process memory). This could be tested by starting 'q' with 2 threads. Total memory in that case should be at least 128MB as per point 1 but .Q.w[] it still shows 64 MB.
That's why in your case at the start memory stats were close to unix stats as all the data was in main thread and nothing on other threads. After doing some operations some threads might have taken some memory (used/garbage) which is not shown by .Q.w[].
Garbage collector call
As mentioned on wiki, calling garbage collector on main thread calls GC on all threads. So that might have collected the garbage memory from threads and reduced the total memory usage which was reflected by reduced unix memory stats.

Why does windbg> !EEHeap -gc show a much smaller managed heap than VMMAP.exe?

I have a C# application whose memory usage increases overtime. I've taken periodic user mode dumps and after loading sos, run !EEHeap -gc to monitor the managed heap size. In windbg/sos I've seen it start ~14MB and grow up to 160MB, then shrink back to 15MB, but the applications "Private Bytes" never decreases significantly. I have identified the activity that cauases the increase in "Private Bytes", so I can control when the memory growth occurs.
I tried running Vmmap.exe and noticed it reports a managed heap of ~360MB, took a quick dump and using windbg/sos/eeheap -gc I only see 15MB.
Why am I seeing such different values?
Is the managed heap really what vmmap.exe reports?
How can I examine this area of the managed heap in windbg?
You can't break into a .NET application with WinDbg and then run VMMap at the same time. This will result in a hanging VMMap. You can also not do it in the opposite direction: start VMMap first, then break into WinDbg and then refresh the values in VMMap.
Therefore the values shown by VMMap are probably never equal, because the numbers are from a different point in time. Different points in time could also mean that the garbage collector has run. If the application is not changing so much, the values should be close.
In my tests, the committed part of the managed heap in VMMap is the sum of !eeheap -gc and !eeheap -loader, which sounds reasonable.
Given the output of !eeheap -gc, we get the start of the GC heap at generation 2 (11aa0000) and a size of only 3.6 MB.
Number of GC Heaps: 1
generation 0 starts at 0x0000000011d110f8
generation 1 starts at 0x0000000011cd1130
generation 2 starts at 0x0000000011aa1000
...
GC Heap Size 0x374a00(3623424)
!address gives the details:
...
+ 0`11aa0000 0`11ef2000 0`00452000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE <unknown>
0`11ef2000 0`21aa0000 0`0fbae000 MEM_PRIVATE MEM_RESERVE <unknown>
0`21aa0000 0`21ac2000 0`00022000 MEM_PRIVATE MEM_COMMIT PAGE_READWRITE <unknown>
0`21ac2000 0`29aa0000 0`07fde000 MEM_PRIVATE MEM_RESERVE <unknown>
+ 0`29aa0000 0`6ca20000 0`42f80000 MEM_FREE PAGE_NOACCESS Free
...
Although not documented, I believe that a new segment starts at 11aa0000, indicated by the + sign. The GC segment ends at 29aa0000, which is also the starting point of the next segment. Cross check: .NET memory should be reported as <unknown> in the last column - ok.
The total GC size (reserved + committed) is
?29aa0000-11aa0000
Evaluate expression: 402653184 = 00000000`18000000
which is 402 MB or 393.216 kB, which in my case is very close to 395.648 kB reported by VMMap.
If you have more GC heaps, the whole process needs more effort. Therefore I typically take the shortcut, which is ok if you know that you don't have anything else than .NET that calls VirtualAlloc(). Type !address -summary and then look at the first <unknown> entry:
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 144 7ff`d8a09000 ( 7.999 Tb) 99.99%
<unknown> 180 0`1a718000 ( 423.094 Mb) 67.17% 0.01%
...
Thank you very much for the detailed answer. Much appreciated.
I'm clear on windbg vs VMmap access/control of the program. Since I can cause the leak by an external action, I'm pretty sure that since I quiesce the activity, memory won't grow much between samples.
I had been relying on the last line of output from !eeheap -gc:
GC Heap Size: Size: 0xed7458 (15561816) bytes.
I think this number must be the amount of managed heap in use (with un-free'ed objects in it). I summed all the "size" bytes reported by "!eeheap -gc" for each SOH and LOH and it matches the above value.
I ran VMmap, took a snap shot and quit VMmap. Then I attached to the process with windbg. Your technique of using !address was most enlightening. I'm using a 12 processor server system, so there are SOH's and LOH's for each processor, i.e 12 to sum. Taking your lead, the output from "!eeheap -gc" has the segments for all of the heaps. I feed them all into "!address " and summed their sizes (plus the size reported by !eeheap -loader ). The result was 335,108K which is within the variation I'd expect to see within the time elapsed (within 600K).
The VMmap Managed Heap seems to be the total amount of all of the memory segments committed for use by the managed heap (I didn't check the Reserved numbers). So now I see why the total reported by "!eeheap -gc" is so much less than what VMmap shows.
Thanks!