I used mmap(just try to understand how mmap works) to allocate 96k anonymous memory, but looks like it split the 96k into 64k and 32k. But when allocate 960k, it allocate only one chunk whose size is 960k. When solaris will split the allocate mem into several part?
Code:
#define PROT PROT_READ | PROT_WRITE
#define MAP MAP_ANON | MAP_PRIVATE
if ((src = mmap(0, 88304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 983040, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 98304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
Truss:
mmap(0x00000000, 88304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E900000
mmap(0x00000000, 983040, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E800000
mmap(0x00000000, 98304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E700000
Pmap:
FFFFFFFF7E700000 64 - - - rw--- [anon]
==> strange is that for 96k, it was broken into 2 part.
FFFFFFFF7E710000 32 - - - rw--- [anon]
FFFFFFFF7E800000 960 - - - rw--- [anon]
FFFFFFFF7E900000 64 - - - rw--- [anon]
FFFFFFFF7E910000 24 - - - rw--- [anon]
FFFFFFFF7EA00000 64 - - - rw--- [anon]
FFFFFFFF7EA10000 32 - - - rw--- [anon]
That is contiguous memory, you can tell by the addresses (F...700000 + 64K = F...710000) so I don't think you have to worry about that. I'm pretty certain that mmap is required to give you contiguous memory in your address space. It would be pretty useless otherwise since it only gives you one base address. With two non-contiguous blocks, there would be no way to find that second block.
So I guess your question is: why does this show up as two blocks in the pmap?
To which my answer would be, "Stuffed if I know". But I can make an intelligent guess which is the best anyone can hope for from me at this time of the morning (pre-coffee).
I would suggest that those blocks had been allocated before to another process (or two) and had been released back to the mmap memory manager. I can see two possibilities on how that memory manager coalesces blocks to make bigger free blocks, either:
it does it as soon as the memory is released (not the case as your output shows that's not happening).
it does it periodically and it hadn't got around to it before you requested your 96K block; or
it doesn't bother at all because it's smart enough to do it during the allocation of a block to you.
I suspect it's the latter simply because the memory manager had no problems giving you two blocks for your request so it's obviously built to handle it. The 960K block is probably not segmented because it came from a much bigger block.
Keep in mind this is speculation (informed, but still speculation). I've seen quite a bit of the internals of UNIX (real UNIXes, not that new kid on the block :-) but I've never had a need to delve into mmap.
I can't remember the term for it (stripes? Slices? wedges? argh) but Solaris allocates different page sizes from pools of various sizes. This turns out to be somewhat more efficient than uniform page sizes, because it uses the memory mapping better. One of those sizes is 32K, another 64K, another is 1024K I believe. To get 96K, you got a 64 and a 32, to get 960 you got most of a 1024K.
The core resource for this wizardry is the Solaris Internals book. Mine, unfortunately, is in a box in the garage at the momrnt.
The answer depends on what you mean by contiguous. Solaris and all modern Unix and unix-like systems (probably all modern operating systems) will divide physical memory into pages, and the memory within a 'page' will be contiguous at the physical level. Most modern systems have a hardware MMU (Memory Management Unit) which will translate a virtual address to a physical address. So the mmap system call will return a contiguous virtual address space but that virtual address will be managed by an MMU which may use multiple pages depending on the size of the page(s) and the size of the memory mapping.
While all the virtual address will be contiguous (within the mapping)
The addresses within 'pages' will also be physically contiguous but the pages and the transitions between pages may not even be close to each other physically.
Related
I'm new to perf, and I'm trying to use it to analyse my programme.
and I got this when running perf top:
PerfTop: 296 irqs/sec kernel:62.8% exact: 0.0% [1000Hz cycles:ppp], (all, 6 CPUs)
-----------------------------------------------------------------------------------------------------------------------
65.43% libc-2.23.so [.] __GI_memset
1.55% libopencv_imgcodecs.so.4.4.0 [.] cv::icvCvt_BGR2RGB_8u_C3R
1.54% libc-2.23.so [.] malloc
1.32% libc-2.23.so [.] _int_free
0.92% [kernel] [k] clear_page
0.91% libjpeg.so.8.0.2 [.] 0x000000000001b828
0.90% libc-2.23.so [.] memcpy
so, I just wonder what cost my 65% of CPU resource, is it really just memset in libc?
if it is, how come it cost this much?
what is __GI_memset?
It's an internal alias for memset.
why does it cost so much CPU resource?
Because you call it a lot, or because you give it a lot of memory set to some value.
Judging by your next most expensive symbol cv::icvCvt_BGR2RGB_8u_C3R, you are doing some kind of image processing, and possibly are allocating cleared images.
One common mistake is to allocate a cleared image and immediately set it something else (thus wasting the time spent clearing it). But there is not enough info here to deduce whether you are doing that here.
In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.
if mmap() was used to read a file, how can I find the number of data mapped by mmap().
float *map = (float *)mmap(NULL, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);
The mmap system call does not read data. It just maps the data in your virtual address space (by indirectly configuring your MMU), and that virtual address space is changed by a successful mmap. Later, your program will read that data (or not). In your example, your program might later read map[356] if mmap has succeeded (and you should test against its failure).
Read carefully the documentation of mmap(2). The second argument (in your code, FILESIZE) defines the size of the mapping (in bytes). You might check that it is a multiple of sizeof(float) and divide it by sizeof(float) to get the number of elements in map that are meaningful and obtained from the file. The size of the mapping is rounded up to a multiple of pages. The man page of mmap(2) says:
A file is mapped in multiples of the page size. For a file that is
not a multiple of the page size, the remaining memory is zeroed when
mapped, and writes to that region are not written out to the file.
Data is mapped in pages. A page is usually 4096 bytes. Read more about paging.
The page size is returned by getpagesize(2) or by sysconf(3) with _SC_PAGESIZE (which usually gives 4096).
Consider reading some book like Operating Systems: Three Easy Pieces (freely downloadable) to understand how virtual memory works and what is a memory mapped file.
On Linux, the /proc/ filesystem (see proc(5)) is very useful to understand the virtual address space of some process: try cat /proc/$$/maps in your terminal, and read more to understand its output. For a process of pid 1234, try also cat /proc/1234/maps
From inside your process, you could even read sequentially the /proc/self/maps pseudo-file to understand its virtual address space, like here.
I'm using a STM32F401VCT6U "discovery" board, and I need to provide a way for the user to write addresses in memory at runtime.
I wrote what can be simplified to the following function:
uint8_t Write(uint32_t address, uint8_t* values, uint8_t count)
{
uint8_t index;
for (index = 0; index < count; ++index) {
if (IS_FLASH_ADDRESS(address+index)) {
/* flash write */
FLASH_Unlock();
if (FLASH_ProgramByte(address+index, values[index]) != FLASH_COMPLETE) {
return FLASH_ERROR;
}
FLASH_Lock();
} else {
/* ram write */
((uint8_t*)address)[index] = values[index]
}
}
return NO_ERROR;
}
In the above, address is the base address, values is a buffer of size at least count which contains the bytes to write to memory and count the number of bytes to write.
Now, my problem is the following: when the above function is called with a base address in flash and count=100, it works normally the first few times, writing the passed values buffer to flash. After those first few calls however, I cannot write just any value anymore: I can only reset bits in the values in flash, eg an attempt to write 0xFF to 0x7F will leave 0x7F in the flash, while writing 0xFE to 0x7F will leave 0x7E, and 0x00 to any value will be successful (but no other value will be writable to the address afterwards).
I can still write normally to other addresses in the flash by changing the base address, but again only a few times (two or three calls with count=100).
This behaviour suggests that the maximum write count of the flash has been reached, but I cannot imagine it can be so fast. I'd expect at the very least 10,000 writes before exhaustion.
So what am I doing wrong?
You have missunderstood how flash works - it is not for example as straight forward as writing EEPROM. The behaviour you are discribing is normal for flash.
To repeatidly write the same address of flash the whole sector must be first erased using FLASH_EraseSector. Generally any data that needs to preserved during this erase needs to be either buffered in RAM or in another flash sector.
If you are repeatidly writing a small block of data and are worried about flash burnout do to many erase write cycles you would want to write an interface to the flash where each write you move your data along the flash sector to unwriten flash, keeping track of its current offset from the start of sector. Only then when you run out of bytes in the sector would you need to erase and start again at start of sector.
ST's "right way" is detailed in AN3969: EEPROM emulation in STM32F40x/STM32F41x microcontrollers
This is more or less the process:
Reserve two Flash pages
Write the latest data to the next available location along with its 'EEPROM address'
When you run out of room on the first page, write all of the latest values to the second page and erase the first
Begin writing values where you left off on page 2
When you run out of room on page 2, repeat on page 1
This is insane, but I didn't come up with it.
I have a working and tested solution, but it is rather different from #Ricibob's answer, so I decided to make this an answer.
Since my user can write anywhere in select flash sector, my application cannot handle the responsability of erasing the sector when needed while buffering to RAM only the data that need to be preserved.
As a result, I transferred to my user the responsability of erasing the sector when a write to it doesn't work (this way, the user remains free to use another address in the sector to avoid too many write-erase cycles).
Solution
Basically, I expose a write(uint32_t startAddress, uint8_t count, uint8_t* values) function that has a WRITE_SUCCESSFUL return code and a CANNOT_WRITE_FLASH in case of failure.
I also provide my user with a getSector(uint32_t address) function that returns the id, start address and end address of the sector corresponding to the address passed as a parameter. This way, the user knows what range of address is affected by the erase operation.
Lastly, I expose an eraseSector(uint8_t sectorID) function that erase the flash sector whose id has been passed as a parameter.
Erase Policy
The policy for a failed write is different from #Ricibob's suggestion of "erase if the value in flash is different of FF", as it is documented in the Flash programming manual that a write will succeed as long as it is only bitreset (which matches the behavior I observed in the question):
Note: Successive write operations are possible without the need of an erase operation when
changing bits from ‘1’ to ‘0’.
Writing ‘1’ requires a Flash memory erase operation.
If an erase and a program operation are requested simultaneously, the erase operation is
performed first.
So I use the macro CAN_WRITE(a,b), where a is the original value in flash and b the desired value. The macro is defined as:
!(~a & b)
which works because:
the logical not (!) will transform 0 to true and everything else to false, so ~a & b must equal 0 for the macro to be true;
any bit at 1 in a is at 0 in ~a, so it will be 0 whatever its value in b is (you can transform a 1 in 1 or 0);
if a bit is 0 in a, then it is 1 in ~a, if b equals 1 then ~a & b != 0 and we cannot write, if bequals 0 it's OK (you can transform a 0 to 0 only, not to 1).
List of flash sector in STM32F4
Lastly and for future reference (as it is not that easy to find), the list of sectors of flash in STM32 can be found on page 7 of the Flash programming manual.
I have C program running on a AVR32 microcontroller (UC3C0512C).
Issuing the avr32-size -A PROGRAM.elf command generates the following output:
PROGRAM.elf :
section size addr
.reset 8200 2147483648
.rela.got 0 2147491848
.text 99512 2147491848
.exception 512 2147591680
.rodata 5072 2147592192
.dalign 4 4
.data 7036 8
.balign 4 7044
.bss 5856 7048
.heap 48536 12904
.comment 48 0
.debug_aranges 8672 0
.debug_pubnames 14476 0
.debug_info 311236 0
.debug_abbrev 49205 0
.debug_line 208324 0
.debug_frame 23380 0
.debug_str 43961 0
.debug_loc 63619 0
.debug_macinfo 94469328 0
.stack 4096 61440
.data_hram0 512 2684354560
.debug_ranges 8368 0
Total 95379957
Can someone explain how to interpret these values?
How can I calculate the flash and ram usage based on this list?
Update 1:
Without the -A flag, I am getting the following:
text data bss dec hex filename
113296 7548 58496 179340 2bc8c PROGRAM.elf
Update 2:
I'm not using dynamic memory allocation, so according the avr-libc user-manual, the free RAM space should be simply: stackpointer minus __heap_start.
In this case: 61440 - 12904 = 48536 byte free RAM space.
Can someone confirm that?
(There is a mismatch in the two outputs in your question. The bss number is wildly different.)
If you don't use malloc, and don't count the stack, then yes, the RAM usage is the data plus the bss (plus some alignment spacing). The data are the variables that are set in a declaration, and the bss are the variables that are not. The C runtime will probably initialize them to 0, but it doesn't have to.
The flash usage will be the text and the data. That is, the flash will include the program instructions and C runtime, but also the values that need to get copied into RAM on startup to initialize those variables. This data is generally tacked onto the end of the program instructions.
Re: update 2
RAM holds global variables, the heap, and then the stack in that order.
The global variables can be initialized in the program, or not. The .data section is stored in flash, and the C runtime copies these values into the beginning of RAM where the corresponding variables live before your code runs. The .bss section of global variables needs space in RAM to hold the values, but they aren't necessarily initialized. The C runtime that comes with avr-gcc does actually initialize these to 0. The point it that your don't need to store an array of 0s to copy over, as you do with the .data section.
You are not using heap, but dynamically allocated memory is obtained from the addresses between heap_start and heap_end.
But the stack is not limited. Yes, the stack-pointer is initialized at startup, but it changes as your program runs, and can move well into the heap or even into the global variables (stack overflow). The stack pointer moves whenever a function is called, or local variables within a function are used. For example, a large array declared inside a function will go on the stack.
So in answer to your question, there is no RAM that is guaranteed to remain free.
I think you should remove the -A (all) flag, since that gives you the more low-level list you're showing.
The default output is easier to parse, and seems to directly state the values you're after.
Note: I didn't try this, not a system with an AVR toolchain installed.
I guess that in your linker script you have RAM at 0, and Flash at 0x80000000, so all things that need to go to RAM are at addresses 0+ (.stack is the last at 61440 (spanning next 4k)). So you would need a bit more that 64k of RAM. Everything else you have is flash.
That is provided that your linker script is correct.
Also see unwind's comment.
These values are the assembly language sections of the compiled C code. See the docs for the details. This article is also helpful.
The section titled .text represents the instruction section, i.e. the assembly instructions. The .data section represents the size of the variables (ints, arrays, etc.). The size column has the significant info, and it has the size of each section in bytes. The .stack and .heap represent the memory allocated in preparation for the execution of the program to set up the virtual memory.
You can try
avr-nm --print-size --radix d --demangle x.elf
to get the sizes in decimal notation.
Then you can copy & paste into a spreadsheet, filter, sort by the sections, and sum it up.