Need help debugging a minidump with WinDbg - windbg

I've read a lot of similar questions, but I can't seem to find an answer to exactly what my problem is.
I've got a set of minidumps from a 32-bit application that was running on 64-bit Windows 2008. The 32-bit Visual Studio on my 32-Bit Vista Business wouldn't touch them at all, so I've been trying to open them in WinDbg.
I don't have the EXACT corresponding .pdb files (we only started saving them AFTER this particular release), but I have .pdbs built by the same machine with the same code. I also have access to the exact executable that created the minidumps.
I found a nifty little application called ChkMatch that can make .pdbs match an executable... the only difference (according to ChkMatch) was age, so I matched my newer .pdbs to the original executable.
However, when I load it in WinDbg, it still says that it is a "mismatched pdb" then, since I had set .symopts+0x40 it tries to load them anyway. I then get the warning:
*** WARNING: Unable to verify checksum for myexe.exe
I ran !lmi myexe and saw that, indeed, the checksum of the executable was in fact zero. From poking around a bit, I've found that the executable should have been built with the /release flag to have a checksum. That's all well and good, but I can't exactly go back in time and rebuild (if I did though, I'd definitely save the original .pdbs :-P ).
Is there anything I can do here? Seems a little ridiculous I can't make things match here at least enough to get a call stack.

you don't need the checksum to get a call stack - this warning can be safely ignored.
to get the stack you need to issue the stack command (any variant of k).
if the minidumps are any good (i.e. describe an actual fault), you should first try the auto analysis !analyze -v which will get you started.
come back when you have exhausted your expertise :o)

If you're working with minidumps then you have to set your image path (Ctrl+I) to point to a location with the images in the dump. The trouble with minidumps is that they don't contain any code or data from the executables on the target, so you have to supply them yourself.
-scott

Related

How to pinpoint where in the program a crash happened using .dmp and WinDbg?

I have a huge application (made in PowerBuilder) that crashes every once in a while so it is hard to reproduce this error. We have it set up so that when a crash like this occurs, we recieve a .dmp file.
I used WinDbg to analyze my .dmp file with the command !analyze -v. From this I can deduct that the error that occured was an Access Violation C0000005. Based on the [0] and [1] parameters, it attempted to dereference a null pointer.
WinDbg also showed me STACK_TEXT consisting of around 30 lines, but I am not sure how to read it. From what I have seen I need to use some sort of symbols.
First line of my STACK_TEXT is this:
00000000`00efca10 00000000`75d7fa46 : 00000000`10df1ae0 00000000`0dd62828 00000000`04970000 00000000`10e00388 : pbvm!ob_get_runtime_class+0xad
From this, my goal is to analyze this file to figure out where exactly in the program this error happened or which function it was in. Is this something I will be able to find after further analyzing the stack trace?
How can I pinpoint where in the program a crash happened using .dmp and WinDbg so I can fix my code?
If you analyze a crash dump with !analyze -v, the lines after STACK TEXT is the stack trace. The output is equivalent to kb, given you set the correct thread and context.
The output of kb is
Child EBP
Return address
First 4 values on the stack
Symbol
The backticks ` tell you that you are running in 64 bit and they split the 64 bit values in the middle.
On 32 bit, the first 4 parameters on the stack were often equivalent to the first 4 parameters to the function, depending on the calling convention.
On 64 bit, the stack is not so relevant any more, because with the 64 bit calling convention, parameters are passed via registers. Therefore you can probably ignore those values.
The interesting part is the symbol like pbvm!ob_get_runtime_class+0xad.
In front of ! is the module name, typically a DLL or EXE name. Look for something that you built. After the ! and before the + is a method name. After the + is the offset in bytes from the beginning of the function.
As long as you don't have functions with thousands of lines of code, that number should be small, like < 0x200. If the number is larger than that, it typically means that you don't have correct symbols. In that case, the method name is no longer reliable, since it's probably just the last known (the last exported) method name and a faaaar way from there, so don't trust it.
In case of pbvm!ob_get_runtime_class+0xad, pbvm is the DLL name, ob_get_runtime_class is the method name and +0xad is the offset within the method where the instruction pointer is.
To me (not knowing anything about PowerBuilder) PBVM sounds like the PowerBuilder DLL implementation for Virtual Memory. So that's not your code, it's the code compiled by Sybase. You'd need to look further down the call stack to find the culprit code in your DLL.
After reading Wikipedia, it seems that PowerBuilder does not necessarily compile to native code, but to intermediate P-Code instead. In this case you're probably out of luck, since your code is never really on the call stack and you need a special debugger or a WinDbg extension (which might not exist, like for Java). Run it with the -pbdebug command line switch or compile it to native code and let it crash again.

How to load symbol files to BCC profiler

With bcc tools' profile, I am getting mostly "[unknown]" in the profile output on my C program. This is, of course, expected because the program symbol are not loaded. However, I am not sure how to properly load the symbols so that "profile" program can pick it up. I have built my program with debug enabled "-g", but how do I load the debug symbols to "profile"?
Please see the DEBUGGING section in bcc profile's manpage:
See "[unknown]" frames with bogus addresses? This can happen for different
reasons. Your best approach is to get Linux perf to work first, and then to
try this tool. Eg, "perf record -F 49 -a -g -- sleep 1; perf script", and
to check for unknown frames there.
The most common reason for "[unknown]" frames is that the target software has
not been compiled
with frame pointers, and so we can't use that simple method for walking the
stack. The fix in that case is to use software that does have frame pointers,
eg, gcc -fno-omit-frame-pointer, or Java's -XX:+PreserveFramePointer.
Another reason for "[unknown]" frames is JIT compilers, which don't use a
traditional symbol table. The fix in that case is to populate a
/tmp/perf-PID.map file with the symbols, which this tool should read. How you
do this depends on the runtime (Java, Node.js).
If you seem to have unrelated samples in the output, check for other
sampling or tracing tools that may be running. The current version of this
tool can include their events if profiling happened concurrently. Those
samples may be filtered in a future version.
In your case, since it's a C program, I would recommend compiling with -fno-omit-frame-pointer.

Win7-64 %windir%, %path% environment variables disappear, can't reload

I am having recurring intermittent problems with "losing" my environment variables, most onerously %windir% and %path%. The problem occurs when I have locked the keyboard and log back in. Rebooting the system (cold- and warm-boot) does not reliably bring them back, but eventually multiple iterations of booting has (so far) brought everything back.
If I open a command window and type echo %windir% and echo %path% and find that the variables exist and are properly defined, and if I leave that command window open, I can leave my system running for days without a problem.
I have captured the results of set to list all envars, both when the system is broken, and when it is fixed. The broken list is much shorter (%windir% is not even defined, %path% contains the definition from registry HKCU\Environment, but not from HKLM\SYSTEM\CurrentControlSet\Control\SessionManager\Environment).
I am guessing that the boot-up process is getting sidetracked.
Spent all morning with Geek Squad but they had no concrete suggestions. (They did suggest "taking the computer back to a previous restore point", but I fear that could cause more problems... and they didn't have confidence it would help.)
Do I have any options beyond possibly reinstalling everything?
I did finally find the answer, and although I understand this subject was closed by the commenter, I thought others might want to know. This link explains it pretty well:
https://superuser.com/questions/355594/windows-7s-path-and-environment-variables-are-corrupted
The short version is this: My system PATH exceeded a Windows maximum of 2048 bytes (it was more than 2200 bytes long). When that happens, the boot process fails to instantiate PATH and WINDIR.
The "fix" was to run c:\windows\system32\systempropertiesadvanced.exe from a command prompt (because without WINDIR, you can't open the Control Panel app for environment variables), and manually extract anything from the PATH I thought I could live without, until I whittled the PATH string down to under 2048 bytes.

Finding out the call site from hex representation

I'm trying to analyse a crash dump of MS BizTalk service, which is constantly consuming 100% CPU (and I assume that's because of our code :) ). I have a couple of dumps and the stack trace of the busiest threads looks similar - the only problem is, that the top of the stack seems to be missing symbols. It looks like this:
0x642`810b2fd0
So, the question is - how can I find out the module/function from this address? (or at least the module, so that I know what symbol file is missing).
lm in WinDbg dumps list of modules. In your case WinDbg does not find any modules that occupy this address -- otherwise it would have printed +. Some of the libraries generate code dynamically, in this case the body of the function will be placed in the heap and won't have any symbols or even module associated with it. I know MCF at some point did this.
I suggest you try to analyze the frames at the top of the stack that have symbols and try to find out what they might be doing.
Wish I could help more, but the only thing I can suggest is reading this cheat sheet of WinDbg commands. There is one command wt which has a list of params which could help with getting module information about that call site.
Let me know if this is any use for you.

Finding a Perl memory leak

SOLVED see Edit 2
Hello,
I've been writing a Perl program to handle automatic upgrading of local (proprietary) programs (for the company I work for).
Basically, it runs via cron, and unfortunately has a memory leak (or something similar). The problem is that the leak only happens when I'm not looking (aka when run via cron, not via command line).
My code does not contain any circular (or other) references, so the commonly cited tools will not help me (Devel::Cycle, Devel::Peek).
How would I go about figuring out what is using so much memory that the kernel kills it?
Basically, the code SFTPs into a server (using ```sftp...`` `), calls OpenSSL to verify the file, and then SFTPs more if more files are needed, and installs them (untars them).
I have seen delays (~15 sec) before the first SFTP session, but it has never used so much memory as to be killed (in my presence).
If I can't sort this out, I'll need to re-write in a different language, and that will take precious time.
Edit: The following message is printed out by the kernel which led me to believe it was a memory leak:
[100023.123] Out of memory: kill process 9568 (update.pl) score 325406 or a child
[100023.123] Killed Process 9568 (update.pl)
I don't believe it is an issue with cron because of the stalling (for ~15 sec, sometimes) when running it via the command-line. Also, there are no environmental variables used (at least by what I've written, maybe underlying things do?)
Edit 2: I found the issue myself, with help from the below comment by mobrule (in response to this question). It turns out that the script was called from a crontab of a user (non-root) just once a day and that (non-root privs) caused a special infinite loop situation.
Sorry guys, I feel kinda stupid for not finding this before, but thanks.
mobrule, if you submit your comment as an answer, I will accept it as it lead to me finding the problem.
End Edits
Thanks,
Brian
P.S. I may be able to post small snippets of code, but not the whole thing due to company policy.
You could try using Devel::Size to profile some of your objects. e.g. in the main:: scope (the .pl file itself), do something like this:
use Devel::Size qw(total_size);
foreach my $varname (qw(varname1 varname2 ))
{
print "size used for variable $varname: " . total_size($$varname) . "\n";
}
Compare the actual size used to what you think is a reasonable value for each object. Something suspicious might pop out immediately (e.g. a cache that is massively bloated beyond anything that sounds reasonable).
Other things to try:
Eliminate bits of functionality one at a time to see if suddenly things get a lot better; I'd start with the use of any external libraries
Is the bad behaviour localized to just one particular machine, or one particular operating system? Move the program to other systems to see how its behaviour changes.
(In a separate installation) try upgrading to the latest Perl (5.10.1), and also upgrade all your CPAN modules
How do you know that it's a memory leak? I can think of many other reasons why the OS would kill a program.
The first question I would ask is "Does this program always work correctly from the command line?". If the answer is "No" then I'd fix these issues first.
On the other hand if the answer is "Yes", I would investigate all the differences between having the program executed under cron and from the command line to find out why it is misbehaving.
If it is run by cron, that shouldn't it die after iteration? If that is the case, hard for me to see how a memory leak would be a big deal...
Are you sure it is the script itself, and not the child processes that are using the memory? Perhaps it ends up creating a real lot of ssh sessions , instead of doing a bunch of stuff in one session?