app not running after enabling gcUnmanagedToManaged MDA - windbg

Because of heap corruption I enabled gcUnmanagedToManaged MDA through a system environment variable. Unfortunately I can't run my app anymore. I always get something like the following:
(a78.1150): Unknown exception - code e053534f (first chance)
eax=00000000 ebx=00dff8ac ecx=00000000 edx=00000000 esi=00000003 edi=00000000
eip=7779013d esp=00dff85c ebp=00dff8f8 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
ntdll!NtWaitForMultipleObjects+0x15:
7779013d 83c404 add esp,4
All my threads seem to be killed (XXXX):
0:002> !threads
ThreadCount: 5
UnstartedThread: 1
BackgroundThread: 3
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
XXXX 1 c98 00401568 6020 Enabled 00000000:00000000 003c7698 4 STA
XXXX 2 4cc 0040be80 b220 Enabled 00000000:00000000 003c7698 0 MTA (Finalizer)
XXXX 3 1150 05525948 7220 Disabled 00000000:00000000 003c7698 0 STA System.StackOverflowException (0265106c)
XXXX 4 e2c 055778c0 200b220 Enabled 00000000:00000000 003c7698 1 MTA
XXXX 5 0 055b4a28 1600 Enabled 00000000:00000000 003c7698 0 Ukn
As soon as I get rid of COMPLUS_MDA=gcUnmanagedToManaged it's working as expected. My system is x86 with .net 3.5 SP1.
Any hints?
UPDATE:
I also tried to debug it:
0:002> k 0xffff
ChildEBP RetAddr
00eef8cc 757015e9 ntdll!NtWaitForMultipleObjects+0x15
00eef968 75091a2c KERNELBASE!WaitForMultipleObjectsEx+0x100
00eef9b0 75094220 KERNEL32!WaitForMultipleObjectsExImplementation+0xe0
00eef9cc 724a7821 KERNEL32!WaitForMultipleObjects+0x18
00eefa2c 724a777e mscorwks!DebuggerRCThread::MainLoop+0xe9
00eefa5c 724a76a5 mscorwks!DebuggerRCThread::ThreadProc+0xe5
00eefa8c 750933aa mscorwks!DebuggerRCThread::ThreadProcStatic+0x9c
00eefa98 777a9ef2 KERNEL32!BaseThreadInitThunk+0xe
00eefad8 777a9ec5 ntdll!__RtlUserThreadStart+0x70
00eefaf0 00000000 ntdll!_RtlUserThreadStart+0x1b

Related

What is the difference bewteen SWAP and Virtual mmeory in Linux

Why I am not seeing any SWAP usage in below scenario, even though the virtual memory is showing 20gig ?
Tasks: 186 total, 1 running, 185 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32880880k total, 17555744k used, 15325136k free, 300268k buffers
Swap: 16383996k total, 0k used, 16383996k free, 2287700k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10314 abcuser 20 0 20.3g 6.8g 25m S 0.7 21.8 23:47.38 java
25119 abcuser 20 0 20.3g 6.7g 25m S 0.3 21.4 23:38.25 java
total used free shared buffers cached
Mem: 31 16 14 0 0 2
-/+ buffers/cache: 14 17
Swap: 15 0 15
Total: 46 16 30
Regards
Sanjeev

MT7621 Soc Crypto Engine - IRQ not mapped

I am using the latest Openwrt trunk firmware (kernel 4.3) and have successfully compiled the driver for its CryptoEngine, an internal ipsec accelerator of MT7621 Soc (which as far as I understood is on an internal bus called AMBA / APB).
The driver seems to work (CryptoEngine gets configured successfully by sending and receiving packets, so it does "react" on memory mapped registers).
However the driver is unable to hook up the irq (so it works only in polling mode), because request_irq always fails with error code 89 (EDESTADDRREQ Destination address required) on IRQ 19 (driver log does not give any error before) and I can't understand why (I am not a linux device driver expert, just a basic understanding, let's say that I managed to recompile the driver).
Any idea why this happens ? Could it be a problem with the board dts file ? Here follows the dmesg log and a link to the dts files used (Dmesg Log too long to include, Soc dtsi (dev.openwrt.org/browser/trunk/target/linux/ramips/dts/mt7621.dtsi) and Board dts))
CryptoEngine memory map 0x1E004000-0x1E004FFF
root#OpenWrt:/# lspci
00:00.0 PCI bridge: Device 0e8d:0801 (rev 01)
00:01.0 PCI bridge: Device 0e8d:0801 (rev 01)
00:02.0 PCI bridge: Device 0e8d:0801 (rev 01)
01:00.0 Network controller: MEDIATEK Corp. MT7662E 802.11ac PCI Express Wireless Network Adapter
02:00.0 Network controller: MEDIATEK Corp. MT7662E 802.11ac PCI Express Wireless Network Adapter
03:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)
root#OpenWrt:/# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
8: 63126 63253 63143 63987 MIPS GIC Local 1 timer
10: 1513 0 0 0 MIPS GIC 10 1e100000.ethernet
11: 2 0 0 0 MIPS GIC 11 mt76pci
29: 0 0 0 0 MIPS GIC 29 xhci-hcd:usb1
31: 2 0 0 0 MIPS GIC 31 mt76pci
33: 7976 0 0 0 MIPS GIC 33 serial
63: 1178 0 0 0 MIPS GIC 63 IPI call
64: 0 1305 0 0 MIPS GIC 64 IPI call
65: 0 0 1257 0 MIPS GIC 65 IPI call
66: 0 0 0 1087 MIPS GIC 66 IPI call
67: 144709 0 0 0 MIPS GIC 67 IPI resched
68: 0 25822 0 0 MIPS GIC 68 IPI resched
69: 0 0 97134 0 MIPS GIC 69 IPI resched
70: 0 0 0 43037 MIPS GIC 70 IPI resched
root#OpenWrt:/# cat /proc/ioports
ffffffff-ffffffff : /pcie#1e140000

20 seconts difference betwen ntp-synced servers

I've got several CentOS 6 servers, synced to pool.ntp.org time-servers.
But sometimes time on them is out of sync, which make difference for 20-30 seconds, which causes errors in my app.
What can be the cause of this, and where should I look for it?
Config
tinker panic 1000 allan 1500 dispersion 15 step 0.128 stepout 900
statsdir /var/log/ntpstats/
leapfile /etc/ntp.leapseconds
driftfile /var/lib/ntp/ntp.drift
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
disable monitor
server 0.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 0.pool.ntp.org nomodify notrap noquery
server 1.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 1.pool.ntp.org nomodify notrap noquery
server 2.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 2.pool.ntp.org nomodify notrap noquery
server 3.pool.ntp.org iburst minpoll 6 maxpoll 10
restrict 3.pool.ntp.org nomodify notrap noquery
restrict default kod notrap nomodify nopeer noquery
restrict 127.0.0.1 nomodify
restrict -6 default kod notrap nomodify nopeer noquery
restrict -6 ::1 nomodify
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10
srv1
remote refid st t when poll reach delay offset jitter
==============================================================================
server01.coloce .STEP. 16 u 11d 1024 0 0.000 0.000 0.000
+mt4.raqxs.net 193.190.230.66 2 u 510 1024 377 6.367 5.984 7.433
+16-164-ftth.ons 193.79.237.14 2 u 217 1024 375 11.339 -0.028 4.564
*services.freshd 213.136.0.252 2 u 419 1024 377 6.735 2.048 4.321
LOCAL(0) .LOCL. 10 l - 64 0 0.000 0.000 0.000
srv2
remote refid st t when poll reach delay offset jitter
==============================================================================
+ntp2.edutel.nl 80.94.65.10 2 u 527 1024 377 11.924 1.469 0.753
-95.211.224.12 193.67.79.202 2 u 364 1024 377 12.989 4.930 0.628
+app.kingsquare. 193.79.237.14 2 u 339 1024 377 5.485 0.493 0.591
*ntp.bserved.nl 193.67.79.202 2 u 206 1024 377 7.007 0.539 0.420
LOCAL(0) .LOCL. 10 l - 64 0 0.000 0.000 0.000

Windbg: Figuring out how many times a breakpoint was hit before a crash

I set the following breakpoint:
bp MSPTLS!LsCreateLine 100
The program crashes before the break point is hit 100 times. When I do bl after the crash, I get the following:
0 e 5dca4b62 0072 (0100) 0:**** MSPTLS!LsCreateLine
I am assuming from this information that the break point was hit 72 times before the crash.
However, when I do bp MSPTLS!LsCreateLine 80 I am able to hit the breakpoint before the crash telling me that the break point was hit more than 72 times before the crash. Does this 72 not indicate how many times the break point was hit?
The default number format of WinDbg is hexadecimal. If you want decimal numbers, prefix them with 0n:
0:005> bp ntdll!DbgBreakPoint 0n100
0:005> bl
0 e 7735000c 0064 (0064) 0:**** ntdll!DbgBreakPoint
The counter 0064 before (0064) counts backwards. You can easily observe that in any GUI application:
0:000> bl
0 e 74fd78d7 000a (000a) 0:**** USER32!NtUserGetMessage+0x15
1 e 74fd78c2 0064 (0064) 0:**** USER32!NtUserGetMessage
0:000> g
Breakpoint 0 hit
eax=00000001 ebx=00000001 ecx=00000000 edx=00000000 esi=001faf8c edi=74fd787b
eip=74fd78d7 esp=001faf44 ebp=001faf60 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
USER32!NtUserGetMessage+0x15:
74fd78d7 83c404 add esp,4
0:000> bl
0 e 74fd78d7 0001 (000a) 0:**** USER32!NtUserGetMessage+0x15
1 e 74fd78c2 005a (0064) 0:**** USER32!NtUserGetMessage
0:000> ? 5a
Evaluate expression: 90 = 0000005a
In the example, breakpoint 0 has been hit 10 times, leaving breakpoint 1 at 90.

Hung processes resume if attached to strace

I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.
I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.
I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.
Any ideas? what causes a hung process to resume it's execution when attached to strace?
Update:
Here's the output of strace on hung processes.
> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "\0", 8192) = 1
write(2, "", 0) = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "", 8192) = 0
read(3, "", 8192) = 0
close(5) = 0
kill(25652, SIGKILL) = 0
exit_group(0) = ?
Process 25645 detached
_
> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192) = -1 EIO (Input/output error)
shutdown(3, 1 /* send */) = 0
exit_group(0) = ?
_
> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3) = 0
dup(2) = 3
fcntl64(3, F_GETFL) = 0x1 (flags O_WRONLY)
close(3) = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17) = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1) = ?
Process 7517 detached
Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.
Update 2:
I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:
> cat /proc/12554/status
Name: someone
State: T (stopped)
SleepAVG: 88%
Tgid: 12554
Pid: 12554
PPid: 9754
TracerPid: 0
Uid: 5000 5000 5000 5000
Gid: 48986 48986 48986 48986
FDSize: 256
Groups: 9149 48986
VmPeak: 1992 kB
VmSize: 1964 kB
VmLck: 0 kB
VmHWM: 608 kB
VmRSS: 608 kB
VmData: 156 kB
VmStk: 20 kB
VmExe: 16 kB
VmLib: 1744 kB
VmPTE: 20 kB
Threads: 1
SigQ: 54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 00000000,00000000,00000000,0000000f
Mems_allowed: 00000000,00000001
strace uses ptrace. The ptrace man page has this:
Since attaching sends SIGSTOP and the tracer usually suppresses it,
this may cause a stray EINTR return from the currently executing system
call in the tracee, as described in the "Signal injection and
suppression" section.
Are you seeing select return EINTR?