What is an address/range of addresses that are guaranteed to be not used in x86-64? - operating-system

I am writing a version of malloc that is compatible with multi-threading. Is is going to use arenas to help facilitate the parallelism.
mmap is being used to create the arenas. Using NULL as the input address to mmap is not working. Is there a range of addresses that is basically guaranteed to be free in x86-64?

Related

Why is address translation needed?

I am taking an operating systems class, and they introduced the concept of address translation. While a program is running, all memory accesses will be translated from virtual to physical. My question is: a lot of memory addresses are given to the program by the OS, and thus can be the physical addresses themselves. What types of memory requests does the program initiate all by itself (without an address given by the OS), and thus would be virtual addresses?
The address of the stack is pre-set by the OS in the stack pointer before the program starts running, and so the stack pointer can hold the physical address. Heap addresses returned by malloc are returned by the OS -- thus, the OS can return the physical address, since that address is stored in some variable and is transparent to the program. So what addresses does the program itself access that need to be translated to physical addresses? So far, the only examples I can think of are: 1) instruction addresses (jump commands have the instruction address hardcoded in the program code) and 2) maybe static variable addresses (if it's not stored in a register by the OS). Are there any more examples/am I missing something?
Thanks!
Maybe the simplest example why address translation is extremely useful:
The physical address space is usually partitioned into pages, e.g. 4k large.
Processes have their own virtual address space, that is also partitioned into pages of the same size.
Address translation, which is done by a memory management unit under control of the operating system, maps any virtual page to any physical page, for every page independently.
It is thus possible to combine arbitrary pages of a fragmented physical memory to continuous virtual pages.
This allows to use the physical memory much more effectively than without address translation.

Using an AT28C256 as non-volatile SRAM for a Z80

I've been using an AT28C256 as EEPROM 'ROM' for a Z80 project quite successfully. As the AT28C256 can be programmed at 5V using the /WE pin, I was thinking about also using it as a form of non-volatile SRAM, rather than adding another chip.
Yes, the AT28C256 is only 32kB in size, so I'm not using the whole 16-bit address space on the Z80 - but I wanted to know if this is possible?
Could I just OR the /MREQ and /WR lines on the Z80 together for the /WE on the AT28C256? Or am I missing something?
I could then set my Stack Pointer (SP) to the 32k boundary, rather than the usual 0xFFFF.
You can use an EEPROM like a RAM, but only if you take its behavior into account.
You can simply connect:
Z80-/MREQ to EEPROM-/CE, but you will need to gate this
Z80-/WR to EEPROM-/WE
Z80-/RD to EEPROM-/OE
Things to consider, consult the data sheet for details:
If you write a byte (or use the page write algorithm) the EEPROM will not output the stored values if you read it, until the self-timed write cycle has passed.
The write cycle is about some milliseconds long.
The EEPROM might fail after a few 10k write cycles (Thanks, Stefan Paul Noack).
You can't use it for the program that changes the chip's contents, because of point 1.
You can't use it for the stack or any other data that needs to be stored and retrieved quickly, because of point 2.
However, you can use it for the application's data. But you will need another memory for the program to run.
And if your program needs a stack or other variables to be written quickly, you will need an additional RAM. (Note: I remember a Z80 application that implemented a printer queue with just simple DRAM, using only the CPU's registers for the program's variables, and using the DRAM only for the data to buffer.)
To have multiple chip's as memory, you will need to gate the /CE-pins of these memories depending on their address range.

Atomicity of small PCIE TLP writes

Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?
I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:
struct PCIE_COMPLETION_T {
uint64_t data_a;
uint64_t data_b;
uint64_t data_c;
uint64_t data_d;
uint8_t valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);
Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?
Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.
The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.
Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.
Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?
The write may be broken into smaller units, as small as dwords, but if it is, they must be observed in increasing address order.
PCIe revision 4, section 2.4.3:
If a single write transaction containing multiple DWs and the Relaxed
Ordering bit Clear is accepted by a Completer, the observed ordering
of the updates to locations within the Completer's data buffer must be
in increasing address order. This semantic is required in case a PCI
or PCI-X Bridge along the path combines multiple write transactions
into the single one. However, the observed granularity of the updates
to the Completer's data buffer is outside the scope of this
specification.
While not required by this specification, it is
strongly recommended that host platforms guarantee that when a PCI
Express write updates host memory, the update granularity observed by
a host CPU will not be smaller than a DW.
As an example of update
ordering and granularity, if a Requester writes a QW to host memory,
in some cases a host CPU reading that QW from host memory could
observe the first DW updated and the second DW containing the old
value.
I don't have a copy of revision 3, but I suspect this language is in that revision as well. To help you find it, Section 2.4 is "Transaction Ordering" and section 2.4.3 is "Update Ordering and Granularity Provided by a Write Transaction".

How to handle buffer and secondary storage with PostgreSQL Server Programming (SPI)?

I am wondering where/how to let PostgreSQL (9.6) handle memory issues between secondary storage (e.g. Hard Drives) and memory buffers?
For example, how to load relevant data into memory when some tuples being queried are not in the buffer; and how to flush some data to disk when the memory buffer is full?
I haven't done server programming before. But when I looked at the Server Programming Interface and the section about memory management, I can't find any mention of "secondary storage" or "buffer" etc. Where are such issues handled?
Can anyone give some pointers about this?
I think you are confused here.
The memory management functions you reference above are to allocate and manage memory that remains allocated after your function has finished (but is freed when the calling statement ends), e.g. to contain results to return to the caller of the function.
Storage management and data buffering happen on a different, much lower, level, and you cannot influence that via SPI. SPI is just an interface for C code running in the server to run SQL statements. As far as shared buffers are concerned, it does not make a difference whether you issue a query from psql or via SPI.

Writing to hard disk from contiguous physical memory

I have an ARM based device, running linux, which is connected to a camera, and I'm trying to store captured frames to HD efficiently.
I'm developing in user space, but can modify drivers at will
I'm coding in C
Frames which are written into memory using DMA, and I have their physical memory pointer.
I am able to control all the frame capturing flow, and I can tell when the frame buffers are stable (dqueued from the video4linux driver)
Linux version is 3.0.35
I'm familiar with kernel source code, not an expert, but I'm able to find my way in it and figure out things, as long as I get some hints...
I believe I have 2 alternatives:
Find the optimal configuration for my filesystem, for opening the file and writing into it. I'm now using ext4, and standard fopen() fwrite() functions. I understand I can also use mmap, or add O_DIRECT flag when calling open(), but didn't try it yet.
Find a way to pass the physical address of the buffer (I can get it
from my Video4Linux driver) directly to the filesystem/hard drive driver,
so the data will be transfered directly from there.
I found method 1 to be slow, having memory transactions as my bottleneck, since fwrite involves copying data from userspace to kernel space, and then again into some sort of cache, and then on to DMA. Too many memory transactions for a simple store...
Regarding method 2 - I don't know if that's possible, but if I was the one designing this system from scratch, this is what I would do.
Any thoughts?
Regarding method 1 (using open() and write(), mmap() and/or O_DIRECT)
can you recommend an optimal settings for my purpose?
Is method 2 (storing to HD directly from an existing DMA buffer) possible? If so - can you point me to an example?
the only problem with writing into a file via mmap on UNIXs, is that you either have to deal with signals in case of out-of-disk-space
or you have make certain that the file is not sparse
and thus all needed disk space is already allocated.
I think an uptodate G++ provides a method of converting signals into C++ exception handling,
but I'm not certain how supported this is on other systems than mac-os.