Is there an equivalent register to Intel's MSR_SMI_COUNT on AMD architecture? - x86-64

On recent Intel CPUs it's possible to count the number of SMIs that have occurred, by reading msr 0x34.
I have checked the manuals at -
https://developer.amd.com/resources/developer-guides-manuals/
for an equivalent register/function, without success.

AMD Zen specifies the LsSmiRx performance counter for System Management Interrupts (SMIs):
PMCx02B [SMIs Received] (Core::X86::Pmc::Core::LsSmiRx)
Counts the number of SMIs received.
(Open-Source
Register Reference
For AMD Family 17h Processors
Models 00h-2Fh. Rev 3.03, 2018, page 153)
On Linux, you can monitor it like this:
# perf stat -e ls_smi_rx -I 60000
This command prints each minute a count of all newly triggered SMIs aggregated over all CPUs.
That means for monitoring - unlike with the MSR_SMI_COUNT register available on Intel CPUs - you have to actively program a PMU register (to observe the LsSmiRx event).
NB: The above referenced AMD documentation confirms that AMD Zen doesn't support the SMI_COUNT MSR (0x34), since it isn't included in the list of available MSRs (in Chapter 2.1.10, page 77).

No, but SMI count is available as a PMC (performance counter) on AMD processors.

Related

Memory leak (?) using IO::Socket::Async (on FreeBSD 13.1)

In processing a stream of logs (via UDP) in a raku (v2022.07) app, I'm
hitting what appears to be a memory leak using IO::Socket::Async.
I pulled the code out into a simpler program which I've included below
(~ identical to code at https://docs.raku.org/type/IO::Socket::Async):
#!/usr/bin/env raku
#
my $socket = IO::Socket::Async.bind-udp('localhost', 24225);
react {
whenever $socket.Supply -> $v {
print $v if $v.chars > 0;
};
};
It leaks substantial ram - I let it run about 12 hours and
when I checked -- still running (on a 1T ram machine) -- with
ps auwwx [pid]
it showed 314974456 and 20739784 for VSZ and RSS (so, roughly 300G v size and 20G resident).
[btw, the UDP traffic is fairly light - average of 350 (~100 byte) packets/sec (spikes to ~1000/sec)]
So .. I rewrote above in perl5 (after similar leaky results w/
a couple of raku variants) which stabilizes quickly at about 8M resident - that's fine/stable/etc. -
but I'd prefer this process to feed a raku channel (without separate perl process/file
tailing, etc.).
My environment: FreeBSD 13.1-RELEASE-p2 GENERIC amd64 and raku:
v2022.07 built on MoarVM 2022.07 (installed with rakubrew).
I'm guessing this is unique to raku on freebsd but not sure.
I did attempt to upgrade (rakubrew) to v2022.12 to see if problem resolved there -
but in rebuilding modules (zef), too many failed (some issue with
Digest/Digest::HMAC) - so I had to revert to 2022.07.
I'll sure be grateful for any suggestions for addressing the leak or alternative
methods to address reading from a UDP port.
Not exactly a solution to your problem, but you can monitor memory usage from within your Raku code using built-in feature:
use Telemetry;
say T{"max-rss"};
Also remember that Supply by default decodes unicode chars. If your protocol is binary you may add :bin to Socket params to avoid treating binary data as text.

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

stm32 factory bootloader possibly overwritten with openocd?

tl;dr: flashed firmware to 0x00000000 instead of 0x08000000, am I lost?
Hello,
my device is based on a STM32F103CBTx which came with a proprietary firmware and had readout protection on.
I connect to it with a ST-Link v2 SWDIO and SWCLK connected to PA13 and PA14 and this command:
sudo openocd -f /usr/share/openocd/scripts/interface/stlink-v2.cfg -f /usr/share/openocd/scripts/target/stm32f1x.cfg
I don't remember how I removed flash protection, but it worked as the original firmware didn't work anymore. Then I created a simple hello world firmware which pulls up and down three gpios and flashed it. The gpios are pulled up and down in 700ms intervals.
After flashing, I can't connect with openocd anymore. I forgot to specify the offset, the manual says the offset defaults to 0 and as it worked, I suppose instead of the boot loader my shitty hello world is pulling up and down some random pins happily… Is this possible? All other threads I found say the boot loader is write protected.
This is the last contact I had:
> halt
halt
target halted due to debug-request, current mode: Handler HardFault
xPSR: 0x01000003 pc: 0xfffffffe msp: 0xffffffdc
> flash write_image erase fw.hex
flash write_image erase fw.hex
auto erase enabled
target halted due to breakpoint, current mode: Handler HardFault
xPSR: 0x61000003 pc: 0x2000003a msp: 0xffffffdc
wrote 4096 bytes from file fw.hex in 0.285697s (14.001 KiB/s)
> reset
reset
jtag status contains invalid mode value - communication failure
Polling target stm32f1x.cpu failed, trying to reexamine
Examination failed, GDB will be halted. Polling again in 100ms
Any directions highly appreciated.
Edit:
What I get now, also tried another st-link:
% sudo openocd -f /usr/share/openocd/scripts/interface/stlink-v2.cfg -f /usr/share/openocd/scripts/target/stm32f1x.cfg
Open On-Chip Debugger 0.10.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : auto-selecting first available session transport "hla_swd". To override use 'transport select '.
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
adapter speed: 1000 kHz
adapter_nsrst_delay: 100
none separate
Info : Unable to match requested speed 1000 kHz, using 950 kHz
Info : Unable to match requested speed 1000 kHz, using 950 kHz
Info : clock speed 950 kHz
Info : STLINK v2 JTAG v17 API v2 SWIM v4 VID 0x0483 PID 0x3748
Info : using stlink api v2
Info : Target voltage: 3.244356
Error: init mode failed (unable to connect to the target)
in procedure 'init'
in procedure 'ocd_bouncer'
flashed firmware to 0x00000000 instead of 0x08000000, am I lost?
No, it doesn't matter at all, they are the same.
After reset, the MCU loads the word at address 0 in SP, and the next one at address 4 in PC. The BOOT0 and BOOT1 pins control which memory gets mapped to 0x00000000. Usually, BOOT0 is tied low, and flash memory at 0x08000000 gets mirrored at 0x00000000.
instead of the boot loader my shitty hello world is pulling up and down some random pins happily… Is this possible? All other threads I found say the boot loader is write protected.
The factory bootloader is indeed write protected, openocd can't overwrite it.
However, your application could have reconfigured the SWD pins, by writing a wrong value in GPIOA->CRH or AFIO->MAPR, thereby preventing openocd from working. It's the most common cause of this problem.
Fortunately, there is a way to recover.
Connect under Reset
If the reset pin of the controller is held low for a while when openocd is started, the application is prevented from starting, and messing up the GPIO configuration.
Openocd can do this automatically, when
It's told to do so, the line reset_config srst_only srst_nogate is present somewhere in the configuration script.
The MCU reset pin is connected to the debugger hardware, pin 15 on an official ST-Link/V2.
Or you can do it manually, by whatever means your board provides. If you are lucky, it has a reset button, if not, you must find a way to somehow ground the MCU reset pin.
Pull the reset pin low
Start openocd
Wait until the Info : Target voltage line appears. Maybe a second longer.
Release the reset pin.
It requires a bit of trial and error, you'll get better with practice.
Then you can flash your improved application, which carefully avoids reconfiguring the SWD pins.

ERROR: No OpenCL platforms found, check OpenCL installation

I tried to run Matlab program on gpu (CentOS 7.3).
This Matlab use caffe.
When I run it from the command line with:
matlab -nodisplay -r "demo, quit"
it run okay.
When I run it with LSF command:
bsub -q gpu -R "select[ngpus>0] rusage[ngpus_shared=1]" matlab -nodisplay -r "demo, quit"
I get the error :
ERROR: No OpenCL platforms found, check OpenCL installation
I comprare the LD_PATH_LIBRARY - are the same.
What can be the problem?
Any ideas are welcome!
clinfo output:
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 8.0.0
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name Tesla K40m
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 375.26
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Available Yes
Device Profile FULL_PROFILE
Device Topology (NV) PCI-E, 09:00.0
Max compute units 15
Max clock frequency 745MHz
Compute Capability (NV) 3.5
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Compiler Available Yes
Linker Available Yes
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Address bits 64, Little-Endian
Global memory size 11995578368 (11.17GiB)
Error Correction support Yes
Max memory allocation 2998894592 (2.793GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 245760 (240KiB)
Global Memory cache line 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 4096x4096x4096 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max constant buffer size 65536 (64KiB)
Max number of constant args 9
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) No
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
My best guess would be that the bsub command from LSF schedules the job on another machine (compute node) in a cluster, where OpenCL is not installed.
Having OpenCL/CUDA on the frontend, but not the compute nodes of a cluster is something I've witnessed quite a few times. Even parts of the filesystem with the libraries are shared, the folder /etc/OpenCL/vendors, used for OpenCLs ICD mechanism must be present.
You could try running clinfo via bsub (if you didn't already), or use bsub to execute ls /etc/OpenCL/vendors.
If you're not sure whether or not the LSF-submitted jobs run on the same machine or not, use the hostname command with and without bsub.
Hope that helps.

mscorjit overlaps mscoree when using windbg

Loading Dump File [C:\Crash_Mode__Date_12-05-2009__Time_15-54-2727\PID-4056__CCNET.EXE__1st_chance_Process_Shut_Down__full_13d0_2009-12-06_00-33-14-734_0fd8.dmp]
User Mini Dump File with Full Memory: Only application data is available
Comment: '1st_chance_Process_Shut_Down_exception_in_CCNET.EXE_running_on_TEST218'
Symbol search path is: srvE:\symbolshttp://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Server 2003 Version 3790 (Service Pack 2) MP (2 procs) Free x64
Product: Server, suite: Enterprise TerminalServer SingleUserTS
Machine Name:
Debug session time: Sun Dec 6 00:33:14.000 2009 (GMT+8)
System Uptime: 32 days 12:43:52.414
Process Uptime: 0 days 8:44:37.000
..........................WARNING: mscorjit overlaps mscoree
..............................WARNING: wldap32 overlaps dnsapi
..........WARNING: rasapi32 overlaps dnsapi
...WARNING: tapi32 overlaps rasapi32
.WARNING: rtutils overlaps rasman
..............WARNING: setupapi overlaps winsta
....wow64cpu!CpupSyscallStub+0x9:
00000000`78b842d9 c3 ret
why this happen?
Not related to CLR, instead it's 32-on-64 as described here http://www.dumpanalysis.org/blog/index.php/2007/09/11/crash-dump-analysis-patterns-part-26/ -- in short use the following:
.load wow64exts
.effmach x86
With those, kb and !analyze -v will give better results.
I hadve seen the same thing latly, I do not know for sure but it's probably some WOW64 artifact or possibly due to some more aggressive anti-exploitation techniques. At leat on Win32, even though the load address of a DLL may be different on occasion, if a DLL is mapped in antother process (like ntdll/kernel32) when your process start's if it is statically linking these DLL's also, it would always load at the same address until the next reboot.
It seems probable that more recent CLR exe's are able to remap various module's per-execution, I know that this is an issue on MSVC10 and Windows7 but perhaps it's been ported to additonal platforms for CLR applications.