windbg diagnosing leaks in 64-bit dumps - !heap not showing memory growth - windbg

I am trying to debug a memory leak in a 64-bit C++ native application. The app leaks 1300 bytes 7-10 times a second - via plain malloc().
If I attach to the process with WinDBG and break into it every 60 seconds, !heap does not show any increase in memory allocated.
I did enable User Mode Stack trace database on the process:
gflags /i <process>.exe +ust
In WinDBG (with all the symbols successfully loaded), I'm using:
!heap -stat -h
But the output of the command never changes when I break in even though I can see the Private Bytes increase in Task Manager and a PerfMon trace.
I understand that when allocations are small they go to HeapAlloc(), when they're bigger they go to VirtualAlloc. Does !heap not work for HeapAlloc?
This post seems to imply that maybe using DebugDiag would work, but it still boils down to using WinDBG commands to process the dump. Tried to no avail.
This post also says that !heap command is broken for 64-bit apps. Could that be the case?
Is there an alternate procedure for diagnosing leaks in 64-bit apps?

!heap does not show any increase in memory allocated.
That may depend on which column you're looking at and how much memory the heap manager has allocated before.
E.g. it's possible that your application has a heap of 100 MB, of which just some blocks of 64kB are moving from the "reserved" column to the "committed" column. If the memory is committed right from the start, you won't see anything at all with a plain !heap command.
I did enable User Mode Stack trace database on the process
That will help you getting the allocation stack traces, but not affect the leak in general.
I understand that when allocations are small they go to HeapAlloc(), when they're bigger they go to VirtualAlloc.
Yes, for allocations > 512k.
Does !heap not work for HeapAlloc?
It should. And since C++ malloc() and new both use the Windows Heap manager, they should result in HeapAlloc() sooner or later.
The following code
#include <iostream>
#include <chrono>
#include <thread>
int main()
{
// https://stackoverflow.com/questions/53157722/windbg-diagnosing-leaks-in-64-bit-dumps-heap-not-showing-memory-growth
//
// I am trying to debug a memory leak in a 64-bit C++ native application.
// The app leaks 1300 bytes 7-10 times a second - via plain malloc().
for(int seconds=0; seconds < 60; seconds++)
{
for (int leakspersecond=0; leakspersecond<8;leakspersecond++)
{
if (malloc(1300)==nullptr)
{
std::cout << "Out of memory. That was unexpected in this simple demo." << std::endl;
}
std::this_thread::sleep_for(std::chrono::milliseconds(125));
}
}
}
compiled as 64 bit release build and run in WinDbg 10.0.15063.400 x64 shows
0:001> !heap -stat -h
Allocations statistics for
heap # 00000000000d0000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
514 1a - 8408 (32.24)
521 c - 3d8c (15.03)
[...]
and later
0:001> !heap -stat -h
Allocations statistics for
heap # 00000000000d0000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
514 30 - f3c0 (41.83)
521 18 - 7b18 (21.12)
even without +ust set.
It's 4.5M lines of code.
How do you then know that it leaks 1300 bytes via plain malloc()?

Related

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

What is the the "Other Memory" is db2mtrk

For example if I run db2mtrk -a -v it gives something like
Memory for application 1234
Application Heap is of size 131072 bytes
Other Memory is of size 262144 bytes
Total: 393216 bytes
I can see the Application Heap size when I run db2pd -db foo -mempools from the physical size but I can't figure out where they get the Other memory total from.
I did a google search and couldn't come up with anything. Any ideas?
See the documentation for db2mtrk which states:
"The "Other Memory" reported is the memory associated with the usage
of operating the database management system."
and there are more details on the memory-allocation page showing how that is made up.
It's more convenient to use the following.
select
p.member
, coalesce(a.application_handle, p.application_handle) application_handle
, p.memory_pool_type
, p.edu_id
, p.memory_pool_used, p.memory_pool_used_hwm
, c.application_id, c.coord_member
from table(mon_get_memory_pool(null, current server, -2)) p
left join table(wlm_get_service_class_agents(null, null, null, -2)) a on a.dbpartitionnum=p.member and a.agent_tid=p.edu_id
left join table(mon_get_connection(null, -2)) c on c.application_handle=coalesce(a.application_handle, p.application_handle) and c.member=p.member
where 1234 in (a.application_handle, p.application_handle)
;

How to interpret avr32-size output?

I have C program running on a AVR32 microcontroller (UC3C0512C).
Issuing the avr32-size -A PROGRAM.elf command generates the following output:
PROGRAM.elf :
section size addr
.reset 8200 2147483648
.rela.got 0 2147491848
.text 99512 2147491848
.exception 512 2147591680
.rodata 5072 2147592192
.dalign 4 4
.data 7036 8
.balign 4 7044
.bss 5856 7048
.heap 48536 12904
.comment 48 0
.debug_aranges 8672 0
.debug_pubnames 14476 0
.debug_info 311236 0
.debug_abbrev 49205 0
.debug_line 208324 0
.debug_frame 23380 0
.debug_str 43961 0
.debug_loc 63619 0
.debug_macinfo 94469328 0
.stack 4096 61440
.data_hram0 512 2684354560
.debug_ranges 8368 0
Total 95379957
Can someone explain how to interpret these values?
How can I calculate the flash and ram usage based on this list?
Update 1:
Without the -A flag, I am getting the following:
text data bss dec hex filename
113296 7548 58496 179340 2bc8c PROGRAM.elf
Update 2:
I'm not using dynamic memory allocation, so according the avr-libc user-manual, the free RAM space should be simply: stackpointer minus __heap_start.
In this case: 61440 - 12904 = 48536 byte free RAM space.
Can someone confirm that?
(There is a mismatch in the two outputs in your question. The bss number is wildly different.)
If you don't use malloc, and don't count the stack, then yes, the RAM usage is the data plus the bss (plus some alignment spacing). The data are the variables that are set in a declaration, and the bss are the variables that are not. The C runtime will probably initialize them to 0, but it doesn't have to.
The flash usage will be the text and the data. That is, the flash will include the program instructions and C runtime, but also the values that need to get copied into RAM on startup to initialize those variables. This data is generally tacked onto the end of the program instructions.
Re: update 2
RAM holds global variables, the heap, and then the stack in that order.
The global variables can be initialized in the program, or not. The .data section is stored in flash, and the C runtime copies these values into the beginning of RAM where the corresponding variables live before your code runs. The .bss section of global variables needs space in RAM to hold the values, but they aren't necessarily initialized. The C runtime that comes with avr-gcc does actually initialize these to 0. The point it that your don't need to store an array of 0s to copy over, as you do with the .data section.
You are not using heap, but dynamically allocated memory is obtained from the addresses between heap_start and heap_end.
But the stack is not limited. Yes, the stack-pointer is initialized at startup, but it changes as your program runs, and can move well into the heap or even into the global variables (stack overflow). The stack pointer moves whenever a function is called, or local variables within a function are used. For example, a large array declared inside a function will go on the stack.
So in answer to your question, there is no RAM that is guaranteed to remain free.
I think you should remove the -A (all) flag, since that gives you the more low-level list you're showing.
The default output is easier to parse, and seems to directly state the values you're after.
Note: I didn't try this, not a system with an AVR toolchain installed.
I guess that in your linker script you have RAM at 0, and Flash at 0x80000000, so all things that need to go to RAM are at addresses 0+ (.stack is the last at 61440 (spanning next 4k)). So you would need a bit more that 64k of RAM. Everything else you have is flash.
That is provided that your linker script is correct.
Also see unwind's comment.
These values are the assembly language sections of the compiled C code. See the docs for the details. This article is also helpful.
The section titled .text represents the instruction section, i.e. the assembly instructions. The .data section represents the size of the variables (ints, arrays, etc.). The size column has the significant info, and it has the size of each section in bytes. The .stack and .heap represent the memory allocated in preparation for the execution of the program to set up the virtual memory.
You can try
avr-nm --print-size --radix d --demangle x.elf
to get the sizes in decimal notation.
Then you can copy & paste into a spreadsheet, filter, sort by the sections, and sum it up.

Suppress iOS Console Output "Unloading xxx unused Assets..."

Is there any way to suppress console output in iPhone player when a new scene is loaded using Application.LoadLevelAdditiveAsync or similar methods?
Unloading 7 Unused Serialized files (Serialized files now loaded: 0 / Dirty serialized files: 0)
Unloading 185 unused Assets to reduce memory usage. Loaded Objects now: 3468. Operation took 377.272217 ms.
System memory in use: 6.7 MB.
Yes it might not be the most important thing on earth but it's somewhat annoying when looking for relevant error messages within noisy output.

How can a usage counter in Solaris 10 /proc filesystem decrease?

I'm trying to determine the CPU utilization of specific LWPs in specific processes in Solaris 10 using data from the /proc filesystem. The problem I have is that sometimes a utilization counter decreases.
Here's the gist of it:
// we'll be reading from the file named /proc/<pid>/lwp/<lwpid>/lwpusage
std::stringstream filename;
filename << "/proc/" << pid << "/lwp/" << lwpid << "/lwpusage";
int fd = open(filename.str().c_str(), O_RDONLY);
// error checking
while(1)
{
prusage_t usage;
ssize_t readResult = pread(usage_fd, &usage, sizeof(prusage_t), 0);
// error checking
std::cout << "sec=" << usage.pr_stime.tv_sec
<< "nsec=" << usage.pr_stime.tv_nsec << std::endl;
// wait
}
close(fd);
The number of nanoseconds reported in the prusage_t struct are derived from timestamps recorded each time an LWP changes state. This feature is called microstate accounting. Sounds good, but every so often the "system call cpu time" counter decreases roughly 1-10 milliseconds.
Update: its not just the "system call cpu time" counter, I've since seen other counters decreasing as well.
Another curiosity is that it always seems to be exactly one sample that's bogus - never two near each other. All the other samples are monotonically increasing at the expected rate. This seems to rule out the possibility that the counter is somehow reset in the kernel.
Any clues as to what's going on here?
> uname -a
SunOS cdc-build-sol10u7 5.10 Generic_139556-08 i86pc i386 i86pc
If you are on a multicore machine, you might check whether this is occurring when the process is migrated from one processor core to a different one. If your processes are running, prstat will show the cpu on which they are running. To minimize lock contention, frequently updated data is sometimes updated in a processor-specific memory area and then synchronized with any copies of the data for other processors.
Just a guess. You might want to disable temporarily NTP and see if the problem still appears.