STM32F4 Inline Assembly

STM32F4 Inline Assembly - stm32

I am working with an STM32F4 Microcontroller, and I am unable to use inline assembly that I am trying to port from another ARM processor. I have no idea where to begin trying to figure out the problem

There is an easy way.. You can use the asm key word.
asm("NOP"); for example will wait for one clock cycle and carry on. You can expand the results.

Well, I would normally say that you should post your code, but in this particular case, I would advise you to always do a little homework on processor architecture when working with microcontrollers.
The STM32F4 (Cortex M4 Processor architecture) does not use the typical arm and thumb instruction sets, like the ARM7 or many other ARM processors. Cortex M4 processors run in Thumb2 mode, which includes subsets of both the ARM and THUMB instruction sets, requiring no arm->thumb or thumb->arm switches (or instructions).

Related

How to decide the registers to be preserved for OS task switching?

When task switch happens in an OS, how to decide which registers should be preserved?
Is this purely decided by hardware architecture? Or also involve the OS implementation?
I once did some naïve implementation on ARM architecture that preserve all the R1 ~ R15 registers (if I remember it correctly). But that seems too much.
I also tried the x86 hardware task switching support, the TSS segment covers a lot of registers which doesn't have good performance as well.
I guess the design philosophy of an OS, especially the implementation of a task state should decide this. But I am not sure if there's any best practice or conventions. Or other factors.

When task switch happens in an OS, how to decide which registers should be preserved?
Normally most of a scheduler would be written in a higher level language (e.g. C), and the low level task switch code will be written as a small assembly language function (and NOT inline assembly) because there's no sane way to predict what a compiler might do with the stack and local variables.
Because of this; which registers the low level assembly function needs to save/restore depends on the ABI ("calling convention") the compiler felt like using. For example, the System V AMD64 ABI says the callee must preserve RBX, RSP, RBP, and R12 to R15 (and can trash RAX, RCX, RDX, and R8 to R11 if they aren't used as return parameters).
This does depend on the nature of the OS though. E.g. it's possible to design an OS where the kernel runs like a separate task and anything that causes a switch from user-space to kernel-space acts like a task switch and has to save everything before any higher level kernel code is executed.

There is a lot of theoretical wiggle room for what registers an OS chooses to preserve. For a "safe" implementation an OS would save all registers that would be accessible by to a user and/or kernel thread. We typically think of the R0,R1,Rx,... (ARM, MIPS, .ect) or RAX,RBX,... (x86) registers needing to be preserved. However, hardware floating point and vector instructions (x86 AVX) may also need preserved.
This is often were the implementation of the OS has wiggle room. One could simply play it safe and preserve all floating point and vector instruction registers. However, if these registers are not being used by a thread, saving off unused registers slows down context switching. Not to mention families of processors may have the same core instructions and registers, but optional floating point or vector extensions. Thus some operating systems support flagging in a thread if floating point or vectors instructions are used by the thread, so the OS knows which additional registers to preserve.

How do I add a missing peripheral register to a STM32 MCU model in Renode?

I am trying out this MCU / SoC emulator, Renode.
I loaded their existing model template under platforms/cpus/stm32l072.repl, which just includes the repl file for stm32l071 and adds one little thing.
When I then load & run a program binary built with STM32CubeIDE and ST's LL library, and the code hits the initial function of SystemClock_Config(), where the Flash:ACR register is being probed in a loop, to observe an expected change in value, it gets stuck there, as the Renode Monitor window is outputting:
[WARNING] sysbus: Read from an unimplemented register Flash:ACR (0x40022000), returning a value from SVD: 0x0
This seems to be expected, not all existing templates model nearly everything out of the box. I also found that the stm32L071 model is missing some of the USARTs and NVIC channels. I saw how, probably, the latter might be added, but there seems to be not a single among the default models defining that Flash:ACR register that I could use as example.
How would one add such a missing register for this particular MCU model?
Note1: For this test, I'm using a STM32 firmware binary which works as intended on actual hardware, e.g. a devboard for this MCU.
Note2:
The stated advantage of Renode over QEMU, which does apparently not emulate peripherals, is also allowing to stick together a more complex system, out of mocked external e.g. I2C and other devices (apparently C# modules, not yet looked into it).
They say "use the same binary as on the real system".
Which is my reason for trying this out - sounds like a lot of potential for implementing systems where the hardware is not yet fully available, and also automatted testing.
So the obvious thing, commenting out a lot of parts in init code, to only test some hardware-independent code while sidestepping such issues, would defeat the purpose here.

If you want to just provide the ACR register for the flash to pass your init, use a tag.
You can either provide it via REPL (recommended, like here https://github.com/renode/renode/blob/master/platforms/cpus/stm32l071.repl#L175) or via RESC.
Assuming that your software would like to read value 0xDEADBEEF. In the repl you'd use:
sysbus:
init:
Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
In the resc or in the Monitor it would be just:
sysbus Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
If you want more complex logic, you can use a Python peripheral, as described in the docs (https://renode.readthedocs.io/en/latest/basic/using-python.html#python-peripherals-in-a-platform-description):
flash: Python.PythonPeripheral # sysbus 0x40022000
size: 0x1000
initable: false
filename: "script_with_complex_python_logic.py"
```
If you really need advanced implementation, then you need to create a complete C# model.
As you correctly mentioned, we do not want you to modify your binary. But we're ok with mocking some parts we're not interested in for a particular use case if the software passes with these mocks.
Disclaimer: I'm one of the Renode developers.

Theoretical embedded linux requirements

I come from a programmer background using Java, C#, C++, Javascript
I got my self a Raspberry Pi (Model 1 A, the one without ethernet) and played around for a while with it. I used Raspbian and Arch Linux ARM (since it was said it is small and customizable). Unfortunatly I didn't manage to configure them as I want to have them.
I am trying to build a nice looking (embedded) system with the only goal to start (boot) the Raspberry Pi fast and autostart a test application which will be written in C# (Mono), C++ (Qt), Java (Java Runtime) or something in JavaScript/HTML.
Since I was not able to get rid of all the log messages (i got rid of most), the tty login screen, the attempts of connecting to the network (although the Model 1 A does not have ethernet at all) booting was ugly and took long (+1 minute in some cases).
It seems I will have to build a minimum embedded linux but I have a lack in the theory of embedded linux elements and how they fit together.
My question: What are the theoretically required parts of an embedded linux holding either mono, qt, java runtime on a raspberry pi?
So far I know the following parts:
the hardware (raspberry pi model 1 A) + sd card
the sd card holds 2 partitions, 1 boot partition (fat32), 1 data partition (ext4)
a boot loader
a linux kernel (which can be optimized to the needs of a raspi)
But what then? My research got lost at "use a distro" what I don't want. What are the missing pieces between the kernel and starting an application?

An Embedded Linux system is comprised of many different parts that work together towards the same goal of making things work efficiently.
Ideally, that is not much different from a regular GNU/Linux system, but let's see in detail the building blocks of a generic embedded system.
For the following explanation, I am assuming as architecture ARM. What is written below may differ slightly from implementation to implementation, but is usually a common track for commercial embedded systems.
Blocks of a GNU/Linux Embedded System
Hardware
SoC
The SoC is where all the processing takes places, it is the main processing unit of the whole system and the only place that has "intelligence". It is in charge of using the other hardware and running your software.
It is made of various and heterogeneous sub-blocks:
Core + Caches + MMU - the "real" processor, e.g. ARM Cortex-A9. It's the main thing you will notice when choosing a SoC.
May be coadiuvated by e.g. a SIMD coprocessor like NEON.
Internal RAM - generally very small. Used in the first phase of the boot sequence.
Various "Peripherals" - connected via some interconnect
fabric/bus to the Core. These can span from a simple ADC to a 3D Graphics Accelerator. Examples of such IP cores are: USB, PCI-E, SGX, etc.
A low power/real time coprocessor - some systems offer one or more coprocessor thought either to help the main Core with real time tasks (e.g. industrial communication buses) or to handle low power states. Its/their architecture might (or not) be a relative of the Core's one.
External RAM
It is used by the SoC to store temporary data after the system has bootstrapped and during the bootstrap itself. It's usually the memory your embedded system uses during regular operation.
Non-Volatile Memory - optional
May or may not be present. In your case it's the SD card you mentioned. In other cases could be a NAND, NOR or SPI Dataflash memory (or any combination of them).
When present, it is often the regular source of data the SoC will read from and usually stores all the SW components needed for the system to work.
Could not be necessary/useful in some kind of applications.
External Peripherals
Anything not strictly related to the above.
Could be a MAC ID EEPROM, some relays, a webcam or whatever you can possibly imagine.
Software
First of all, we introduce what is called the bootchain, which is what happens as soon as you power up your SoC and - someway - tell it to start running. In the following list, the bootchain is the subsequent calls of point 1 to point 4.
Apart from specific/exotic implementations, it is more or less always the same:
Boot ROM code - a small (usually masked - aka factory impressed) memory contained in the SoC. The first thing the SoC will do when powered up is to execute the code in it.
This code will - generally according to external configuration pins - decide the so-called "boot strategy" or "boot order", which is where (and in what order) to look for additional code to be executed. The suitable mediums are disparate: USB storage devices, USB hosts, SD cards, NANDs, NORs, SPI dataflashes, Ethernets, UARTs, etc.
If none of the above contains something valid, the Boot ROM will usually issue a soft reset of the SoC, and so on.
The code in the medium is not, of course, executed in place: it gets copied into the Internal RAM then executed.
[The following two are contained in what we will call bootloader medium]
1st stage bootloader - it has just been copied by the Boot ROM into
the SoC's Internal RAM. Must be tiny enough to fit that memory
(usually well under 100kB). It is needed because the Boot ROM isn't
big enough and does not know what kind of External RAM the SoC is
attached to. Has the main important function of initializing the
External RAM and the SoC's external memory interface, as well as
other peripherals that may be of interest (e.g. disable watchdog
timers). Once done, it copies the next stage to the External RAM and
executes it. Depending on the context, could be called MLO, SPL or
else.
2nd stage bootloader - the "main" bootloader. Bigger (could be x10) than the 1st stage one, completes the initializiation of the
relevant peripherals (e.g. ethernet, additional storage media, LCD
displays). Allows a much more complicated logic for what to do next
and offers - depending on the level of sofistication - high level
facilities (filesystem/volume handling, data
copy-move-interpretation, LCD output, interactive console, failsafe
policies). Most of the times loads a Linux kernel (and related) into
memory from some medium and passes relevant information to it (e.g.
if not embedded, for newer kernels the DTB physical address is put
in the r2 register - the Kernel then reads the register and
retrieves the DTB)
Linux Kernel - the core of the operating system. Depending on the
hardware platform may or may not be a mainline ("official") version.
Is usually completed by built-in or loadable (from an external
source - free or not) modules. Initializes all the hardware needed for the complete system to work according to hardcoded configuration and the DT - enables MMU, orchestrates the whole system and accesses the hardware exlusively. According to the boot arguments
(cmdline - usually passed by the previous stage) and/or to compiled
options, the Kernel tries to mount a root file system. From the
rootfs, it will try to load an init (namely, /sbin/init - where / is
the just mounted rootfs).
Init and rootfs - init is the first non-Kernel task to be run, and
has PID 1. It initalizes literally everything you need to use your
system. In production embedded systems, it also starts the main
application. In such systems is either BusyBox or a custom crafted
application.
More on rootfs and distros
Rootfs contains all of your GNU/Linux systems that is not Kernel (apart from /lib/modules and other bits).
It contains all the applications that manage peripherals like Ethernet, WiFi, or external UMTS modems.
Contains the interactive part of the system, contains the user interface, and everything else you see when you boot a GNU/Linux system - embedded or not.
A "distro" is just a particular collection of userspace (non-Kernel) programs and libraries (usually) verified to work well one with the other, put toghether by a particular group of people.
Desktop distros usually also ship with a custom-tailored kernel and a bootloader. Examples are Fedora, Ubuntu, Debian, etc.
In the general sense of the term, nothing stops you from creating your own distro, which is what happens everytime a custom embedded system goes in production: through tools like Yocto or Buildroot (or by hand), in fact, you are able to decide the very particular collection (hence distro, distribution) of softwares fit for the purpose of the system.
To sum up and answer exactly to your question, the missing part you are looking for is init and the process of mounting the rootfs: the Kernel mounts - aka renders available to itself - via its drivers and the passed/builtin parameters - a given volume/partition (the ext4 data partition you mention) to the "/" mount point.
In this volume/partition there is a /sbin/init executable, which the Kernel executes.
This is the "Big Bang" of our GNU/Linux userspace system: the place where everything visible starts. Depending on the configuration scripts (usually located under /etc/init.d) the "application" you mention is either run automatically by init or by the user via a terminal/ssh/whatever that - again - init made you possible to use.

trying to know more about verilog language, vhdl,and assembly language

I would like to know what is the difference between verilog and assembly language.
Next semester we will be working with micro-controllers, but I would like to learn a little bit about it before the semester begins. I've been doing a lot of research about low-level programming, and so far I have gained a good understanding in assembly language, but I get confused trying to understand Verilog and VHDL?

Verilog and VHDL are completely different languages for describing hardware, for purposes of programming FPGAs.
FPGAs are devices that can be on-the-fly programmed to implement any sort of digital logic (and sometimes analog too).
So using verilog or VHDL, I can design a circuit that creates a couple latches, some twos-complement adders, a mux, and a clock source, and suddenly you've just designed a circuit that can calculate. You could then take the output from the VHDL compiler (or whatever its called), "download" it to the FPGA, and now you actually have some hardware that can be used to do calculation.
Of course, you can use FPGAs to implement all sorts of complicated stuff - even a full custom CPU. One uses verilog and VHDL to design the circuits that are programmed to FPGAs. Those circuits could implement something simple like a ripple counter, or something more complex like a LCD driver, or something even more complex like a USB transceiver. You can go from as simple as a few latches to as complicated as a fully operating CPU; as long as its digital hardware, you can make whatever you want with VHDL and some FPGAs.
To clarify further -
"Assembly language" typically refers to raw instructions given to some sort of CPU. Of course, there are many different types of CPUs (x86, ARM, SPARC, MIPS) and further many different variants of those types of CPUs. Each CPU has its own instruction set.
Machine code is complete, fully specified, ready to be executed instructions. Assembly languages allow you type instructions from your CPU's instruction set in plain text, use labels and such, and describe the memory layout structure of the program. Put the assembly through an assembler and out comes machine code in your CPUs machine instruction set.
You could design your own CPU from scratch using VHDL. As you're designing the CPU, you would have it implement your own custom instruction set. From there, you could take the VHDL for your CPU, compile it, write it to an FPGA and have your own custom CPU. Then you could start writing programs for your made-up CPU using your custom instruction set by writing a custom assembler. Some friends of mine in college did this for giggles.
For example, you know how most CPUs are load-store, register based CPUs? Instructions tend to go something like this:
Load the value '1' into register A
Load the value '2' into register B
Add register A and register B, storing result in register A
(You just added 1 + 2! Heh)
That sort of model of computation happens to be the most popular, but it's not the only way you could do computation. What if you had a stack based CPU, where you push values onto a hardware stack, and then computations work with the values on the top of the stack, pushing results back onto the stack.
For instance:
Push 1 onto the stack (stack current contains: 1)
Push 2 onto the stack (stack current contains: 2 1)
Push 3 onto the stack (stack currently contains: 3 2 1 )
Add
'Add' takes the top two elements on the stack, adds them together, and pushes the result on the top of the stack.
Stack now contains: 5 1
Add
Stack now contains: 6
Neat isn't it? As far as a computation model goes, it has its advantages - operands tend to be short, and need fewer bits. Smaller instructions means that the CPU can be faster.
The problem is that no such processor like this exists anymore.
But if you knew what you were doing, you could design one in VHDL, program it to an FGPA, and suddenly you have one of the only operating stack-based processors in existence.
Say, if you were doing a masters thesis, for instance, you might dig around and find out that virtual-machine-based programming languages like C# and Java compile down to a bytecode for a CPU that doesn't really exist, but the model for that CPU proves useful for making code portable. You might find out that the imaginary machines used by these languages are based on stack-based processor models. If you were looking for something interesting to do, perhaps you write in VHDL a processor that natively implements the Java bytecode language. Now you'd be the only person that has a computer that can directly run Java.

Verilog and VHDL are both HDLs (Hardware description languages) used mainly for describing digital electronics. Their targets may be FPGA or ASIC (custom silicon).
Assembly level on the other hand is using an processors instruction set to perform a series of calculations. Every thing executed on a computer eventually ends up as an assembly level instruction. One example of an instruction set would be the x86 ISA.
Summary: Verilog, VHDL describe hardware. Assembly is the low level program being executed on a processor.

A trivial SYSENTER/SYSCALL question

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.

If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.

As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia

They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse