Reading and writing memory, but having trouble writing to a virtual address - operating-system

I am trying to write a program where I scan a processes memory and can also write to these addresses(just like cheat engine). However I did some research and found out that the memory I was reading is virtual memory I can read this memory but I can't write to it and to translate it I need page tables. So my question is where can I find these page tables and is there any other way to write using the virtual address I get?

Virtual memory is an elaborate illusion. What you think is read/write RAM may actually be data in swap space, or "ready only, copy on write", or something else.
To maintain the illusion, and for security, and for compatibility (e.g. 32-bit program running on a 64-bit CPU with a 64-bit kernel); user-space is not given access to page tables.
An OS or kernel might provide an abstract interface to some of the information (with suitable restrictions and limitations for security). One example of this would be the VirtualQuery() and VirtualQueryEx() functions in Windows (see https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualqueryex ).
In a similar way, an OS or kernel might provide an abstract interface to alter a page's permissions (with suitable restrictions and limitations for security). One example of this would be the VirtualProtect() function in Windows (see https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualprotect ).
... and is there any other way to write using the virtual address I get?
If your CPU is an 80x86 CPU that supports Intel's transactional extensions; you can misuse "transactions" to suppress page faults (make them cause a "transaction abort" instead of triggering a page fault).
This won't allow you to write to a read-only or "not present" page; but will allow you to attempt to write without being detected by the OS.

Related

OS: how does kernel virtual memory help in making swap pages of the page table easier?

Upon reading this chapter from "Operating Systems: Three Easy Pieces" book, I'm confused of this excerpt:
If in contrast the kernel were located entirely in physical memory, it would be quite hard to do things like swap pages of the page table to disk;
I've been trying to make sense of it for days but still can't get as to how kernel virtual memory helps in making swap pages of the page table easier. Wouldn't it be the same if the kernel would live completely in physical memory as the pages of different page tables' processes would end up in physical memory in the end anyway (and thus be swapped to disk if needed)? How is it different when page tables reside in kernel virtual memory vs. in kernel-owned physical memory?
Let's assume the kernel needs to access a string of characters from user-space (e.g. maybe a file name passed to an open() system call).
With the kernel in virtual memory; the kernel would have to check that the virtual address of the string is sane (e.g. not an address for kernel's own data) and also guard against a different thread in user-space modifying the data the pointer points to (so that the string can't change while the kernel is using it, possibly after the kernel checked that the string itself is valid but before the kernel uses the string). In theory (and likely in practice) this could all be hidden by a "check virtual address range" function. Apart from those things; kernel can just use the string like normal. If the data is in swap space, then kernel can just let its own page fault handler fetch the data when kernel attempts to access it the data and not worry about it.
With kernel in physical memory; the kernel would still have to do those things (check the virtual address is sane, guard against another thread modifying the data the pointer points to). In addition; it would have to convert the virtual address to a physical addresses itself, and ensure the data is actually in RAM itself. In theory this could still be hidden by a "check virtual address range" function that also converts the virtual address into physical address(es).
However, the "contiguous in virtual memory" data (the string) may not be contiguous in physical memory (e.g. first half of the string in one page with the second half of the string in a different page with a completely unrelated physical address). This means that kernel would have to deal with this problem too (e.g. for a string, even things like strlen() can't work), and it can't be hidden in a "check (and convert) virtual address range" function to make it easier for the rest of the kernel.
To deal with the "not contiguous in physical memory" problem there's mostly only 2 possibilities:
a) A set of "get char/short/int.. at offset N using this list of physical addresses" functions; or
b) Refuse to support it in the kernel; which mostly just shifts unwanted burden to user-space (e.g. the open() function in a C library copying the file name string to a new page if the original string crossed a page boundary before calling the kernel's open() system call).
In an ideal world (which is what we live in now with kernels that are hundreds of megabytes in size running on machines with gigabytes of physical RAM) the kernel would never swap even parts of itself. But in the old days, when physical memory was a constraint, the less of the kernel in physical memory, the more the application could be in physical memory. The more the application is in physical memory, the fewer page faults in user space.
THe linux kernel has been worked over fairly extensively to keep it compact. Case in point: kernel modules. You can load a module using insmod or modprobe, and that module will become resident, but if nothing uses is, after a while it will get swapped out, and that's no big deal because nothing is using it.

Is UEFI required to map 4k pages on x64?

I am creating a kernel for x64 which is booting with UEFI. While the kernel has to be loaded at a low-ish address (I believe, because UEFI requires identity mapped pages so it cannot be mapped higher than the highest physical address), I want to relocate up to the end of memory. During this process I intend on creating new paging structures and in order to reduce memory consumption, I wanted to reuse the page tables used to map the image in the lower half. However, these page tables will only exist if 4k paging is used by UEFI, so my question is whether or not UEFI is required to use 4k paging on x64. I believe the answer is no, but I hope otherwise and wanted to see if this is true.
Now I understand UEFI allocates memory via BootServices->AllocatePage in 4k chunks it refers to as pages, but is this required to translate to the actual mapping structure used? I noticed that in section 2.3.6 of the UEFI 2.8 specification, the section referring to AArch64 calling conventions, it states
MMU configuration: Implementations must use only 4k pages [...]
There is no similar denotation in section 2.3.4, on the x64 calling conventions, which is why I believe the answer is no.
EDIT:
Based upon what I've already seen and the comment by Peter Cordes, I believe the standard does not specify exactly what it should be. Thus a revised version of the question is: Does the standard specify 4k translation granularity? If not, do most UEFI vendors on x64 use 4k pages?

What exactly is a machine instruction?

The user's program in main memory consists of machine instructions and
data. In contrast, the control memory holds a fixed microprogram that
cannot be altered by the occasional user. The microprogram consists of
microinstructions that specify various internal control signals for
execution of register microoperations. Each machine instruction
initiates a series of micro instructions in control memory. These
microsinstructions generates microoperations to fetch the instruction
for main memory; to evaluate the effective address, to execute the
operation specified by the instruction, and to return control the
fetch phase in order to repeat the cycle for the next instruction
I don't exactly understand here the difference between machine instruction, microinstruction and micropeerations. i certainly do understand that microinstructions according to the paragraph given are the intermediate level of instructions but which of the other 2 is the one that is more close to the machine language. Are CLA, ADD, STA, BUN, BSA, AND etc machine instructions or microoperations?
A CPU presents itself to the outside as a device capable of executing machine instructions. For example,
mov (%esi,%ebx,4), %edx
is a machine instruction that moves 4 bytes of data at address ESI+4*EBX into register EDX. Machine instructions are public - they are published by CPU manufacturer in a user manual. Compilers such as gcc will output files that contain machine instructions, and these will typically end up in EXE/DLL files.
If you look closely at the above instruction, you will see that it is a fairly complex operation. It involves some arithmetic (multiplying and addition) to get the memory address, then moving data from that address into a register. From CPU's perspective, it would also make sense to use the arithmetical unit that is already there. So it makes natural sense to break down this instruction into microinstructions. In essence, mov instruction is implemented internally by CPU as a microprogram written in microinstructions. This is, however, an implementation detail of a CPU. Microinstructions are internal to CPU and they are invisible to anybody except to CPU manufacturer.
Microinstructions have several benefits:
they simplify internal CPU architecture, design and testing, thus lowering cost per unit
they make it easy to create rich and powerful sets of machine instructions (you just have to combine microinstrcutions in different ways)
they provide a consistent machine language across different CPUs (e.g. Xeon and Pentium both implement basic x86_64 instruction set even though they are very different in hardware)
create optimizations (i.e. the same instruction on one CPU can be implemented by a hardware, the other can be emulated in microinstructions)
fix bugs (e.g. you can fix Spectre vulnerability while the machine is running and without buying a new CPU and opening your server)
For more information, see https://en.wikipedia.org/wiki/Micro-operation
I think the answer to your question is in these three sentences:
The user's program in main memory consists of machine instructions and data
Each machine instruction initiates a series of micro-instructions in control memory.
These micro-instructions generate micro-operations.
So:
The user supplies machine instructions
Those get translated into micro-instructions
Those get translated into micro-operations
The mnemonics you mentioned are what the user might use to write or read a list of machine instructions (the actual instructions just being patterns of bits understood by the processor). The "occasional user" (i.e. everyone other than the chip's designer) never needs to deal directly in micro-instructions or micro-operations, so would never know individual names for them.

How to handle buffer and secondary storage with PostgreSQL Server Programming (SPI)?

I am wondering where/how to let PostgreSQL (9.6) handle memory issues between secondary storage (e.g. Hard Drives) and memory buffers?
For example, how to load relevant data into memory when some tuples being queried are not in the buffer; and how to flush some data to disk when the memory buffer is full?
I haven't done server programming before. But when I looked at the Server Programming Interface and the section about memory management, I can't find any mention of "secondary storage" or "buffer" etc. Where are such issues handled?
Can anyone give some pointers about this?
I think you are confused here.
The memory management functions you reference above are to allocate and manage memory that remains allocated after your function has finished (but is freed when the calling statement ends), e.g. to contain results to return to the caller of the function.
Storage management and data buffering happen on a different, much lower, level, and you cannot influence that via SPI. SPI is just an interface for C code running in the server to run SQL statements. As far as shared buffers are concerned, it does not make a difference whether you issue a query from psql or via SPI.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."