How to benefit from heap tagging by DLL? - windbg

How do I use and benefit from the GFlags setting Enable heap tagging by DLL?
I know how to activate the setting for a process, but I did not find useful information in the output of !heap -t in WinDbg. I was expecting some output like this:
0:000> !heap -t
Index Address Allocated by
1: 005c0000 MyDll.dll
2: 006b0000 AnotherDll.dll
so that I can identify which heap was created by which DLL and then e.g. identify the source of a memory leak.
Is this a misunderstanding of the term "heap tagging by DLL" or do I need some more commands to get to the desired result?
My research so far:
I googled for a tutorial on this topic, but I couldn't find a detailed description
I read WinDbg's .hh !heap but it's not explained there in detail as well. Tag is only used in !heap -b

again a very late answer
to benefit from HeapTagging you need to create a tag first in your code.
as far as i know (that is upto xp-sp3) there were no Documented APIS to Create a tag
(I havent mucked with heap since then so i am not aware of latest apis in os > vista Rewrites were done to heap manager so probably many of the ^^^features^^^ that i post below might have been corrected or bettered or bugs removed )
in xp-sp3 you can use undocumented RtlCreateTagHeap to create a new tag to either Process Heap or Private Heap
and after you create tha tag you need to set the global flag 8000 | 800
htg - Enable heap tagging
htd - Enable heap tagging by DLL
and theoratically all allocs and frees must get tagged .
but practically only allocations > 512 kB gets tagged in xp-sp3 with these basic steps
it either is a bug or a feature that limits tagging to allocations and frees > 512 kB
HeapAlloc goes through ZwAllocateVirtualMemory in case of Allocations > 512 kB in 32 bit process refer HeapCreate / HeapAlloc Documentation in msdn
and as a debuging aid you can patch ntdll.dll on the fly to enable tagging for all Allocations and frees .
below is a sample code that demonstrates the tagging and how to view it all in windbg
compile using cl /Zi /analyze /W4 <src> /link /RELEASE
use windbg to execute the app and watch tagging with !heap * -t command
#include <windows.h>
#include <stdio.h>
//heaptags are kinda broken or they are intentionally
//given only to allocations > 512 kb // allocation > 512 kb
//go through VirtualAlloc Route for Heap created with maxsize
//set to 0 uncomment ALLOCSIZE 0xfdfd2 and recompile to watch
// tagging increase by 100% with ALLOCSIZE 0xfdfd1 only 50 allocs
// and frees that are > 512 kB will be tagged these magic numbers
// are related to comment in HeapCreate Documentation that state
// slightly less than 512 kB will be allocated for 32 bit process
// tagging can be dramatically increased by patching ntdll when
// stopped on system breakpoint patch 7c94b8a4 (xpsp3 ntdll.dll)
// use the below command in windbg for finding the offset of pattern
// command must be in single line no line breaks
// .foreach /pS 4 /ps 4 ( place { !grep -i -e call -c
// "# call*RtlpUpdateTagEntry 7c900000 l?20000" } ) { ub place }
// the instruction we are searching to patch is
//7c94b8a1 81e3ff0fffff and ebx,0FFFF0FFFh
// patch 0f to 00 at system breakpoint with eb 7c94b8a1+3 00
#define BUFFERSIZE 100
#define ALLOCSIZE 0xfdfd1
//#define ALLOCSIZE 0xfdfd2
typedef int ( __stdcall *g_RtlCreateTagHeap) (
HANDLE hHeap ,
void * unknown,
wchar_t * BaseString,
wchar_t * TagString
);
void HeapTagwithHeapAllocPrivate()
{
PCHAR pch[BUFFERSIZE] = {};
HANDLE hHeap = 0;
ULONG tag1 = 0;
ULONG tag2 = 0;
ULONG tag3 = 0;
ULONG tag4 = 0;
ULONG tag5 = 0;
g_RtlCreateTagHeap RtlCreateTagHeap = 0;
HMODULE hMod = LoadLibrary("ntdll.dll");
if(hMod)
{
RtlCreateTagHeap = (g_RtlCreateTagHeap)
GetProcAddress( hMod,"RtlCreateTagHeap");
}
if (hHeap == 0)
{
hHeap = HeapCreate(0,0,0);
if (RtlCreateTagHeap != NULL)
{
tag1 = RtlCreateTagHeap (hHeap,0,L"HeapTag!",L"MyTag1");
tag2 = RtlCreateTagHeap (hHeap,0,L"HeapTag!",L"MyTag2");
tag3 = RtlCreateTagHeap (hHeap,0,L"HeapTag!",L"MyTag3");
tag4 = RtlCreateTagHeap (hHeap,0,L"HeapTag!",L"MyTag4");
}
}
HANDLE DefHeap = GetProcessHeap();
if ( (RtlCreateTagHeap != NULL) && (DefHeap != NULL ))
{
tag5 = RtlCreateTagHeap (DefHeap,0,L"HeapTag!",L"MyTag5");
for ( int i = 0; i < BUFFERSIZE ; i++ )
{
pch[i]= (PCHAR) HeapAlloc( DefHeap,HEAP_ZERO_MEMORY| tag5, 1 );
HeapFree(DefHeap,NULL,pch[i]);
}
}
if(hHeap)
{
for ( int i = 0; i < BUFFERSIZE ; i++ )
{
pch[i]= (PCHAR) HeapAlloc( hHeap,HEAP_ZERO_MEMORY| tag1, 1 );
//lets leak all allocs patch ntdll to see the tagging details
//HeapFree(hHeap,NULL,pch[i]);
}
for ( int i = 0; i < BUFFERSIZE ; i++ )
{
pch[i]= (PCHAR) HeapAlloc( hHeap,HEAP_ZERO_MEMORY| tag2, 100 );
// lets leak 40% allocs patch ntdll to see the tagging details
if(i >= 40)
HeapFree(hHeap,NULL,pch[i]);
}
// slightly less than 512 kb no tagging
for ( int i = 0; i < BUFFERSIZE / 2 ; i++ )
{
pch[i]= (PCHAR) HeapAlloc(
hHeap,HEAP_ZERO_MEMORY| tag3, ALLOCSIZE / 2 );
}
// > 512 kb default tagging
for ( int i = BUFFERSIZE / 2; i < BUFFERSIZE ; i++ )
{
pch[i]= (PCHAR) HeapAlloc(
hHeap,HEAP_ZERO_MEMORY | tag4 ,ALLOCSIZE );
}
for (int i =0 ; i < BUFFERSIZE ; i++)
{
HeapFree(hHeap,NULL,pch[i]);
}
}
}
void _cdecl main()
{
HeapTagwithHeapAllocPrivate();
}
the compiled exe to be run with windbg as below
DEFAULT execution and inspection
**only 50 tags will be visible all of them are > 512 kB Allocations
cdb -c "g;!heap * -t;q" newheaptag.exe | grep Tag**
heaptag:\>cdb -c "g;!heap * -t;q" newheaptag.exe | grep Tag
Tag Name Allocs Frees Diff Allocated
Tag Name Allocs Frees Diff Allocated
Tag Name Allocs Frees Diff Allocated
0004: HeapTag!MyTag4 50 50 0 0
patching ntdll on system breakpoint should make all tags visible
eb = write byte
patch and run the exe on exit inspect heaps with tags
cdb -c "eb 7c94b8a1+3 00;g;!heap * -t;q" newheaptag.exe | grep Tag
heaptag:\>cdb -c "eb 7c94b8a1+3 00;g;!heap * -t;q" newheaptag.exe | grep Tag
Tag Name Allocs Frees Diff Allocated
0012: HeapTag!MyTag5 100 100 0 0 <-our tag in process heap
Tag Name Allocs Frees Diff Allocated
Tag Name Allocs Frees Diff Allocated
0001: HeapTag!MyTag1 100 0 100 3200 <--- leak all
0002: HeapTag!MyTag2 100 60 40 5120 <--- leak 40 %
0003: HeapTag!MyTag3 50 50 0 0 <--- clean < 512 kB
0004: HeapTag!MyTag4 50 50 0 0 <----clean > 512 kB

Related

Why is Devel::LeakTrace leaking memory?

I am trying to learn more about how to detect memory leaks in Perl.
I have this program:
p.pl:
#! /usr/bin/env perl
use Devel::LeakTrace;
my $foo;
$foo = \$foo;
Output:
leaked SV(0xac2df8e0) from ./p.pl line 5
leaked SV(0xac2df288) from ./p.pl line 5
Why is this leaking two scalars (and not just a single)?
Then I run it through valgrind. First I created a debugging version of perl:
$ perlbrew install perl-5.30.0 --as=5.30.0-D3L -DDEBUGGING \
-Doptimize=-g3 -Accflags="-DDEBUG_LEAKING_SCALARS"
$ perlbrew use 5.30.0-D3L
$ cpanm Devel::LeakTrace
Then I ran valgrind setting PERL_DESTRUCT_LEVEL=2 as recommended in perlhacktips:
$ PERL_DESTRUCT_LEVEL=2 valgrind --leak-check=yes perl p.pl
==12479== Memcheck, a memory error detector
==12479== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12479== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==12479== Command: perl p.pl
==12479==
leaked SV(0x4c27320) from p.pl line 5
leaked SV(0x4c26cc8) from p.pl line 5
==12479==
==12479== HEAP SUMMARY:
==12479== in use at exit: 105,396 bytes in 26 blocks
==12479== total heap usage: 14,005 allocs, 13,979 frees, 3,011,508 bytes allocated
==12479==
==12479== 16 bytes in 1 blocks are definitely lost in loss record 5 of 21
==12479== at 0x483874F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==12479== by 0x484851A: note_changes (LeakTrace.xs:80)
==12479== by 0x48488E3: XS_Devel__LeakTrace_hook_runops (LeakTrace.xs:126)
==12479== by 0x32F0A2: Perl_pp_entersub (pp_hot.c:5237)
==12479== by 0x2C0C50: Perl_runops_debug (dump.c:2537)
==12479== by 0x1A2FD9: Perl_call_sv (perl.c:3043)
==12479== by 0x1ACEE3: Perl_call_list (perl.c:5084)
==12479== by 0x181233: S_process_special_blocks (op.c:10471)
==12479== by 0x180989: Perl_newATTRSUB_x (op.c:10397)
==12479== by 0x220D6C: Perl_yyparse (perly.y:295)
==12479== by 0x3EE46B: S_doeval_compile (pp_ctl.c:3502)
==12479== by 0x3F4F87: S_require_file (pp_ctl.c:4322)
==12479==
==12479== LEAK SUMMARY:
==12479== definitely lost: 16 bytes in 1 blocks
==12479== indirectly lost: 0 bytes in 0 blocks
==12479== possibly lost: 0 bytes in 0 blocks
==12479== still reachable: 105,380 bytes in 25 blocks
==12479== suppressed: 0 bytes in 0 blocks
==12479== Reachable blocks (those to which a pointer was found) are not shown.
==12479== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==12479==
==12479== For counts of detected and suppressed errors, rerun with: -v
==12479== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
so 16 bytes are lost. However, if I comment out the line use Devel::LeakTrace in p.pl and run valgrind again, the output is:
==12880== Memcheck, a memory error detector
==12880== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12880== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==12880== Command: perl p.pl
==12880==
==12880==
==12880== HEAP SUMMARY:
==12880== in use at exit: 0 bytes in 0 blocks
==12880== total heap usage: 1,770 allocs, 1,770 frees, 244,188 bytes allocated
==12880==
==12880== All heap blocks were freed -- no leaks are possible
==12880==
==12880== For counts of detected and suppressed errors, rerun with: -v
==12880== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
So the question is: Why is Devel::LeakTrace causing a memory leak?
It seems like there are even more memory leaks than valgrind reported.
Each time a new SV is created, Devel::LeakTrace records the current file name and line number in a 16 bytes structure called when:
typedef struct {
char *file;
int line;
} when;
These blocks are allocated at line #80 with malloc() but it seems it never frees these blocks. So the more scalars are created, the more memory will leak.
Some background information
The module tries to determine leaked SVs from the END{} phaser. At this point all allocated SVs should have gone out of scope from the main program and had their reference count decreased to zero, which should destroy them. However, if for some reason the reference count is not decremented to zero, the scalar will not be destroyed and freed
from perl's internal memory management pool. In this case the scalar is considered as leaked by the module.
Note that this is not the same as leaked memory as seen from the operating
systems memory pool handled by e.g. malloc(). When perl exits it will still
free any leaked scalars (from its internal memory pool) back to the systems memory pool.
This means that the module is not meant to detect leaked system memory. For this, we can use e.g. valgrind.
The module hooks into the perl runops loop and for each OP that is of type OP_NEXTSTATE it will scan all arenas and all SVs in those for new SVs (that is: SVs that has been introduced since the previous OP_NEXTSTATE).
For this sample program p.pl in my question I counted 31 arenas, and each arena contained space for 71 SVs. Almost all of these SVs were in use during run time (approximately 2150 of them). The module keeps each of these SVs in a hash used with key equal to the address of the SV and value equal to the when block (see above) where the scalar was allocated. For each OP_NEXTSTATE, it can then scan all SVs and check if there are some that are not present in the used hash.
The used hash is not a perl hash ( I guess this was to avoid any conflicts with
the allocated SVs that the module tries to keep track of), instead the module uses GLib hash tables.
Patch
In order to keep track of the allocated when blocks, I used a new glib hash called when_hash. Then after the module had printed the leaked scalars, the when blocks could be freed by looking up all keys in the when_hash.
I also found that the module did not free the used-hash. As far as I can see it should be calling the glib g_hash_table_destroy() to release it from the END{} block. Here is the patch:
LeakTrace.xs (patched):
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include <glib.h>
typedef struct {
char *file;
int line;
} when;
/* a few globals, never mind the mess for now */
GHashTable *used = NULL;
GHashTable *new_used = NULL;
/* cargo from Devel::Leak - wander the arena, see what SVs live */
typedef long used_proc _((void *,SV *,long));
/* PATCH: fix memory leaks */
/***************************/
GHashTable *when_hash = NULL; /* store the allocated when blocks here */
static int have_run_end_hook = 0; /* indicator to runops that we are done */
static runops_proc_t save_orig_run_ops; /* original runops function */
/* Called from END{}, i.e. from show_used() after having printed the leaks.
* Free memory allocated for the when blocks */
static
void
free_when_block(gpointer key, gpointer value, gpointer user_data) {
free(key);
}
static
void
do_cleanup() {
/* this line was missing from the original show_used() */
if (used) g_hash_table_destroy( used );
if (when_hash) g_hash_table_foreach( when_hash, free_when_block, NULL );
g_hash_table_destroy( when_hash );
PL_runops = save_orig_run_ops;
have_run_end_hook = 1;
}
/* END PATCH: fix memory leaks */
/*******************************/
static
long int
sv_apply_to_used(void *p, used_proc *proc, long n) {
SV *sva;
for (sva = PL_sv_arenaroot; sva; sva = (SV *) SvANY(sva)) {
SV *sv = sva + 1;
SV *svend = &sva[SvREFCNT(sva)];
while (sv < svend) {
if (SvTYPE(sv) != SVTYPEMASK) {
n = (*proc) (p, sv, n);
}
++sv;
}
}
return n;
}
/* end Devel::Leak cargo */
static
long
note_used(void *p, SV* sv, long n) {
when *old = NULL;
if (used && (old = g_hash_table_lookup( used, sv ))) {
g_hash_table_insert(new_used, sv, old);
return n;
}
g_hash_table_insert(new_used, sv, p);
return 1;
}
static
void
print_me(gpointer key, gpointer value, gpointer user_data) {
when *w = value;
char *type;
switch SvTYPE((SV*)key) {
case SVt_PVAV: type = "AV"; break;
case SVt_PVHV: type = "HV"; break;
case SVt_PVCV: type = "CV"; break;
case SVt_RV: type = "RV"; break;
case SVt_PVGV: type = "GV"; break;
default: type = "SV";
}
if (w->file) {
fprintf(stderr, "leaked %s(0x%x) from %s line %d\n",
type, key, w->file, w->line);
}
}
static
int
note_changes( char *file, int line ) {
static when *w = NULL;
int ret;
/* PATCH */
if (have_run_end_hook) return 0; /* do not enter after clean up is complete */
/* if (!w) w = malloc(sizeof(when)); */
if (!w) {
w = malloc(sizeof(when));
if (!when_hash) {
/* store pointer to allocated blocks here */
when_hash = g_hash_table_new( NULL, NULL );
}
g_hash_table_insert(when_hash, w, NULL); /* store address to w */
}
/* END PATCH */
w->line = line;
w->file = file;
new_used = g_hash_table_new( NULL, NULL );
if (sv_apply_to_used( w, note_used, 0 )) w = NULL;
if (used) g_hash_table_destroy( used );
used = new_used;
return ret;
}
/* Now this bit of cargo is a derived from Devel::Caller */
static
int
runops_leakcheck(pTHX) {
char *lastfile = 0;
int lastline = 0;
IV last_count = 0;
while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
PERL_ASYNC_CHECK();
if (PL_op->op_type == OP_NEXTSTATE) {
if (PL_sv_count != last_count) {
note_changes( lastfile, lastline );
last_count = PL_sv_count;
}
lastfile = CopFILE(cCOP);
lastline = CopLINE(cCOP);
}
}
note_changes( lastfile, lastline );
TAINT_NOT;
return 0;
}
MODULE = Devel::LeakTrace PACKAGE = Devel::LeakTrace
PROTOTYPES: ENABLE
void
hook_runops()
PPCODE:
{
note_changes(NULL, 0);
PL_runops = runops_leakcheck;
}
void
reset_counters()
PPCODE:
{
if (used) g_hash_table_destroy( used );
used = NULL;
note_changes(NULL, 0);
}
void
show_used()
CODE:
{
if (used) g_hash_table_foreach( used, print_me, NULL );
/* PATCH */
do_cleanup(); /* released allocated memory, restore original runops */
/* END PATCH */
}
Testing the patch
$ wget https://www.cpan.org/modules/by-module/Devel/Devel-LeakTrace-0.06.tar.gz
$ tar zxvf Devel-LeakTrace-0.06.tar.gz
$ cd Devel-LeakTrace-0.06
$ perlbrew use 5.30.0-D3L
# replace lib/Devel/LeakTrace.xs with my patch
$ perl Makefile.PL
$ make
$ make install # <- installs the patch
# cd to test folder, then
$ PERL_DESTRUCT_LEVEL=2 valgrind --leak-check=yes perl p.pl
==25019== Memcheck, a memory error detector
==25019== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25019== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==25019== Command: perl p.pl
==25019==
leaked SV(0x4c26cd8) from p.pl line 5
leaked SV(0x4c27330) from p.pl line 5
==25019==
==25019== HEAP SUMMARY:
==25019== in use at exit: 23,324 bytes in 18 blocks
==25019== total heap usage: 13,968 allocs, 13,950 frees, 2,847,004 bytes allocated
==25019==
==25019== LEAK SUMMARY:
==25019== definitely lost: 0 bytes in 0 blocks
==25019== indirectly lost: 0 bytes in 0 blocks
==25019== possibly lost: 0 bytes in 0 blocks
==25019== still reachable: 23,324 bytes in 18 blocks
==25019== suppressed: 0 bytes in 0 blocks
==25019== Reachable blocks (those to which a pointer was found) are not shown.
==25019== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==25019==
==25019== For counts of detected and suppressed errors, rerun with: -v
==25019== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
First, valgrind reports 16 bytes of leaked memory in a script containing only up to use Devel::LeakTrace. The possible leak is independent of the fourth and fifth lines. From your link,
NOTE 3: There are known memory leaks when there are compile-time errors
within eval or require, seeing S_doeval in the call stack is a good sign
of these. Fixing these leaks is non-trivial, unfortunately, but they must be fixed
eventually.
Since I see the line by 0x3F18E5: S_doeval_compile (pp_ctl.c:3502), and a similar line in your example, I would say that this is why Devel::LeakTrace causes an apparent memory leak.
Second, regarding the original script, Devel::LeakTrace is simply reporting the leak caused by (at least) a circular reference at the fifth line. You can see this by using weaken from Scalar::Util:
#! /usr/bin/env perl
use Devel::LeakTrace;
use Scalar::Util;
my $foo;
$foo = \$foo;
Scalar::Util::weaken($foo);
Then, perl p.pl will not report any leak. My guess is that the first scripts reports two leaks because, in addition to creating a circular reference, perl is losing a pointer at $foo = \$foo. There is some magic I cannot understand that occurs when you weaken $foo that apparently fixes both issues. You can see this by tweaking the original script:
#! /usr/bin/env perl
use Devel::LeakTrace;
my $foo;
my $bar = \$foo;
$foo = $bar;
The resulting $foo should be identical, we have just created $bar to hold the reference. However, in this case the script only reports one leak.
So, in summary, I would say that 1)Devel::LeakTrace has a bug that shows as a memory leak in valgrind independently of the code; 2) perl is creating a circular reference and losing a pointer in the original script, which is why Devel::LeakTrace reports two leaks.

Why is accessing every other cache line slower on x86, not matching Intel's documented cache behavior?

According to Intel's optimization manual, the L1 data cache is 32 KiB and 8-way associative with 64-byte lines. I have written the following micro-benchmark to test memory read performance.
I hypothesize that if we access only blocks that can fit in the 32 KiB cache, each memory access will be fast, but if we exceed that cache size, the accesses will suddenly be slower. When skip is 1, the benchmark accesses every line in order.
void benchmark(int bs, int nb, int trials, int skip)
{
printf("block size: %d, blocks: %d, skip: %d, trials: %d\n", bs, nb, skip, trials);
printf("total data size: %d\n", nb*bs*skip);
printf("accessed data size: %d\n", nb*bs);
uint8_t volatile data[nb*bs*skip];
clock_t before = clock();
for (int i = 0; i < trials; ++i) {
for (int block = 0; block < nb; ++block) {
data[block * bs * skip];
}
}
clock_t after = clock() - before;
double ns_per_access = (double)after/CLOCKS_PER_SEC/nb/trials * 1000000000;
printf("%f ns per memory access\n", ns_per_access);
}
Again with skip = 1, the results match my hypothesis:
~ ❯❯❯ ./bm -s 64 -b 128 -t 10000000 -k 1
block size: 64, blocks: 128, skip: 1, trials: 10000000
total data size: 8192
accessed data size: 8192
0.269054 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 256 -t 10000000 -k 1
block size: 64, blocks: 256, skip: 1, trials: 10000000
total data size: 16384
accessed data size: 16384
0.278184 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 1
block size: 64, blocks: 512, skip: 1, trials: 10000000
total data size: 32768
accessed data size: 32768
0.245591 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 1024 -t 10000000 -k 1
block size: 64, blocks: 1024, skip: 1, trials: 10000000
total data size: 65536
accessed data size: 65536
0.582870 ns per memory access
So far, so good: when everything fits in L1 cache, the inner loop runs about 4 times per nanosecond, or a bit more than once per clock cycle. When we make the data too big, it takes significantly longer. This is all consistent with my understanding of how the cache should work.
Now let's accessing every other block by letting skip be 2.
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 2
block size: 64, blocks: 512, skip: 2, trials: 10000000
total data size: 65536
accessed data size: 32768
0.582181 ns per memory access
This violates my understanding! It would make sense for a direct-mapped cache, but since our cache is associative, I can't see why the lines should be conflicting with each other. Why is accessing every other block slower?
But if I set skip to 3, things are fast again. In fact, any odd value of skip is fast; any even value is slow. For example:
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 7
block size: 64, blocks: 512, skip: 7, trials: 10000000
total data size: 229376
accessed data size: 32768
0.265338 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 12
block size: 64, blocks: 512, skip: 12, trials: 10000000
total data size: 393216
accessed data size: 32768
0.616013 ns per memory access
Why is this happening?
For completeness: I am on a mid-2015 MacBook Pro running macOS 10.13.4. My full CPU brand string is Intel(R) Core(TM) i7-4980HQ CPU # 2.80GHz. I am compiling with cc -O3 -o bm bm.c; the compiler is the one shipped with Xcode 9.4.1. I have omitted the main function; all it does is parse the command-line options and call benchmark.
The cache is not fully-associative, it's set-associative, meaning that each address maps to a certain set, and the associativity only works among lines that map to the same set.
By making the step equal 2, you keep half of the sets out of the game, so the fact you access effectively 32K doesn't matter - you only have 16k available (even sets, for example), so you still exceed your capacity and start thrashing (end fetching data from the next level).
When the step is 3, the problem is gone since after wrapping around you can use all the sets. Same would go for any prime number (which is why it's sometimes used for address hashing)

luajit qsort callback example memory leak

I have the following qsort example to try out callbacks in luajit. However it has a memory leak (luajit: not enough memory when executing) which is not obvious to me.
Can somebody give me some hints on how to create a proper callback example?
local ffi = require("ffi")
-- ===============================================================================
ffi.cdef[[
void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));
]]
function compare(a, b)
return a[0] - b[0]
end
-- ===============================================================================
-- Explicitly convert to a callback via cast
local callback = ffi.cast("int (*)(const char *, const char *)", compare)
local data = "efghabcd"
local size = 8
local loopSize = 1000 * 1000 * 100.
local bytes = ffi.new("char[15]")
-- ===============================================================================
for i=1,loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
end
Platform: OSX 10.8
luajit: 2.0.1
The problem appears to be that lua never gets a chance to perform a full garbage collection cycle inside the tight loop. As hinted by the comment, you can correct this by calling collectgarbage() yourself inside the loop.
Note that calling collectgarbage() on every iteration will impact the running time of whatever you're benching. To minimize this, you should set a threshold to limit how often collectgarbage() gets called:
local memthreshold = 2 ^ 20 / 1024
local start = os.clock()
for i = 1, loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
if collectgarbage'count' > memthreshold then
collectgarbage()
end
end
local elapse = os.clock() - start
print("elapsed:", elapse..'s')

Perl: Devel::Gladiator module and memory management

I have a perl script that needs to run in the background constantly. It consists of several .pm module files and a main .pl file. What the program does is to periodically gather some data, do some computation, and finally update the result recorded in a file.
All the critical data structures are declared in the .pl file with our, and there's no package variable declared in any .pm file.
I used the function arena_table() in the Devel::Gladiator module to produce some information about the arena in the main loop, and found that the SVs of type SCALAR and GLOB are increasing slowly, resulting in a gradual increase in the memory usage.
The output of arena_table (I reformat them, omitting the title. after a long enough period, only the first two number is increasing):
2013-05-17#11:24:34 36235 3924 3661 3642 3376 2401 720 201 27 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1 1
After running for some time:
2013-05-17#12:05:10 50702 46169 36910 4151 3995 3924 2401 720 274 201 26 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1
The main loop is something like:
our %hash1 = ();
our %hash2 = ();
# and some more package variables ...
# all are hashes
do {
my $nowtime = time();
collect_data($nowtime);
if (calculate() == 1) {
update();
}
sleep 1;
get_mem_objects(); # calls arena_table()
} while (1);
Except get_mem_objects, other functions will operate on the global hashes declared by our. In update, the program will do some log rotation, the code is like:
sub rotate_history() {
my $i = $main::HISTORY{'count'};
if ($i == $main::CONFIG{'times'}{'total'}) {
for ($i--; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
} else {
for (; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
}
$main::HISTORY{'data'}{'0'} = dclone(\%main::COLLECT);
if ($main::HISTORY{'count'} < $main::CONFIG{'times'}{'total'}) {
$main::HISTORY{'count'}++;
}
}
If I comment the calls to this function, in the final report given by Devel::Gladiator, only the SVs of type SCALAR is increasing, the number of GLOBs will finally enter a stable state. I doubt the dclone may cause the problem here.
My questions are,
what exactly does the information given by that module mean? The statements in the perldoc is a little vague for a perl newbie like me.
And, what are the common skills to lower the memory usage of long-running perl scripts?
I know that package variables are stored in the arena, but how about the lexical variables? How are the memory consumed by them managed?

Munin plugin in perl doesn't fetch correct data

I'm trying to use a Munin plugin for software raid. Here's the plugin's code: https://github.com/munin-monitoring/contrib/blob/master/plugins/disk/raid
Currently my raid is rebuilding, here's the current output:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sda3[0] sdb3[1]
2925544767 blocks super 1.2 [2/2] [UU]
[==>..................] resync = 14.4% (422554560/2925544767) finish=5246.6min speed=7950K/sec
md1 : active raid1 sda2[0] sdb2[1]
524276 blocks super 1.2 [2/2] [UU]
resync=DELAYED
md0 : active raid1 sda1[0] sdb1[1]
4193268 blocks super 1.2 [2/2] [UU]
resync=DELAYED
unused devices: <none>
But when I run the plugin I get the following output (stating that all disks are synced):
# munin-run raid
md2.value 100
md2_rebuild.value 100
md1.value 100
md1_rebuild.value 100
md0.value 100
md0_rebuild.value 100
In the following lines I understand (I'm no programmer) that during the time the code runs, $pct is >= 100, and so $rpct gets set to 100 (which is my output for all raid arrays).
So which values do $nact and $nmem represent in my cat /proc/mdstat output? This would help me find out why $pct is >= 100.
my $pct = 100 * $nact / $nmem;
my $rpct = 100;
if ( $pct < 100 ) {
my #output = `/sbin/mdadm -D /dev/$dev | grep Rebuild`;
if( $output[0] =~ /([0-9]+)% complete/ ) {
$rpct = $1;
} else {
$rpct = 0;
}
I think this regexp holds the answer, but as I said, I'm no programmer :P
while ($text =~ /(md\d+)\s+:\s+active\s+(\(auto-read-only\)\s+|)(\w+)\s+(.*)\n.*\[(\d+)\/(\d+)]\s+\[(\w+)]/ ) {
my($dev,$dummy,$type,$members,$nmem,$nact,$status) = ($1,$2,$3,$4,$5,$6,$7);
Thanks in advance :-)
change this:
if ( $pct < 100 ) {
to this:
if ( $pct <= 100 ) {
and make sure you're running the plugin as root