what is `__GI_memset`? why does it cost so much CPU resource? - libc

I'm new to perf, and I'm trying to use it to analyse my programme.
and I got this when running perf top:
PerfTop: 296 irqs/sec kernel:62.8% exact: 0.0% [1000Hz cycles:ppp], (all, 6 CPUs)
-----------------------------------------------------------------------------------------------------------------------
65.43% libc-2.23.so [.] __GI_memset
1.55% libopencv_imgcodecs.so.4.4.0 [.] cv::icvCvt_BGR2RGB_8u_C3R
1.54% libc-2.23.so [.] malloc
1.32% libc-2.23.so [.] _int_free
0.92% [kernel] [k] clear_page
0.91% libjpeg.so.8.0.2 [.] 0x000000000001b828
0.90% libc-2.23.so [.] memcpy
so, I just wonder what cost my 65% of CPU resource, is it really just memset in libc?
if it is, how come it cost this much?

what is __GI_memset?
It's an internal alias for memset.
why does it cost so much CPU resource?
Because you call it a lot, or because you give it a lot of memory set to some value.
Judging by your next most expensive symbol cv::icvCvt_BGR2RGB_8u_C3R, you are doing some kind of image processing, and possibly are allocating cleared images.
One common mistake is to allocate a cleared image and immediately set it something else (thus wasting the time spent clearing it). But there is not enough info here to deduce whether you are doing that here.

Related

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

Cannot enforce memory limits in SLURM

I am using Slurm on a single node (control and compute) and I cannot seem to correctly limit memory. The script seems to call SBATCH with small memory values (3G), but I see values in top that exceed 25G. Sacct gives me the correct values:
squeue -o "%C %m"
CPUS MIN_MEMORY
2 3G
This is my slurm.conf:
#
SlurmctldHost=schopenhauer
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
#ProctrackType=proctrack/cgroup
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300000
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm/slurm_jobacct.log
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm/slurm_jobcomp.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=schopenhauer CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 State=UNKNOWN
PartitionName=short Nodes=schopenhauer Default=YES MaxTime=INFINITE State=UP
Did I misunderstand something? Why does it say minimum memory when I want to make that minimum and maximum as well?
EDIT: I just noticed by setting required memory to a larger one that this doesn't work as a minimum either i.e. many tasks were started even though there was enough RAM for only 12 of them (I requested 40G and I have 500G). Is this the same problem?
Slurm controls memory through the Linux cgroup functionality. You need to set TaskPlugin=task/cgroup in slurm.conf (Cf. https://slurm.schedmd.com/cgroups.html) and ConstrainRAMSpace=yes in cgroup.conf (Cf. https://slurm.schedmd.com/cgroup.conf.html). Then the memory requested by jobs with --mem or --mem-per-cpu becomes effectively a hard limit in addition to being a resource request.
The -m option of gives the memory requested by the job. As a request, it is considered a minimum requirement. But if you configure cgroup it effectively also becomes the maximum.
I don't think slurm enforces memory or cpu usage. It's just there as indication what you think your job's usage will be. To set binding memory you could use ulimit, something like ulimit -v 3G at the beginning of your script.
Just know that this will likely cause problems with your program as it actually requires the amount of memory it requests, so it won't finish succesfully.

How to interpret avr32-size output?

I have C program running on a AVR32 microcontroller (UC3C0512C).
Issuing the avr32-size -A PROGRAM.elf command generates the following output:
PROGRAM.elf :
section size addr
.reset 8200 2147483648
.rela.got 0 2147491848
.text 99512 2147491848
.exception 512 2147591680
.rodata 5072 2147592192
.dalign 4 4
.data 7036 8
.balign 4 7044
.bss 5856 7048
.heap 48536 12904
.comment 48 0
.debug_aranges 8672 0
.debug_pubnames 14476 0
.debug_info 311236 0
.debug_abbrev 49205 0
.debug_line 208324 0
.debug_frame 23380 0
.debug_str 43961 0
.debug_loc 63619 0
.debug_macinfo 94469328 0
.stack 4096 61440
.data_hram0 512 2684354560
.debug_ranges 8368 0
Total 95379957
Can someone explain how to interpret these values?
How can I calculate the flash and ram usage based on this list?
Update 1:
Without the -A flag, I am getting the following:
text data bss dec hex filename
113296 7548 58496 179340 2bc8c PROGRAM.elf
Update 2:
I'm not using dynamic memory allocation, so according the avr-libc user-manual, the free RAM space should be simply: stackpointer minus __heap_start.
In this case: 61440 - 12904 = 48536 byte free RAM space.
Can someone confirm that?
(There is a mismatch in the two outputs in your question. The bss number is wildly different.)
If you don't use malloc, and don't count the stack, then yes, the RAM usage is the data plus the bss (plus some alignment spacing). The data are the variables that are set in a declaration, and the bss are the variables that are not. The C runtime will probably initialize them to 0, but it doesn't have to.
The flash usage will be the text and the data. That is, the flash will include the program instructions and C runtime, but also the values that need to get copied into RAM on startup to initialize those variables. This data is generally tacked onto the end of the program instructions.
Re: update 2
RAM holds global variables, the heap, and then the stack in that order.
The global variables can be initialized in the program, or not. The .data section is stored in flash, and the C runtime copies these values into the beginning of RAM where the corresponding variables live before your code runs. The .bss section of global variables needs space in RAM to hold the values, but they aren't necessarily initialized. The C runtime that comes with avr-gcc does actually initialize these to 0. The point it that your don't need to store an array of 0s to copy over, as you do with the .data section.
You are not using heap, but dynamically allocated memory is obtained from the addresses between heap_start and heap_end.
But the stack is not limited. Yes, the stack-pointer is initialized at startup, but it changes as your program runs, and can move well into the heap or even into the global variables (stack overflow). The stack pointer moves whenever a function is called, or local variables within a function are used. For example, a large array declared inside a function will go on the stack.
So in answer to your question, there is no RAM that is guaranteed to remain free.
I think you should remove the -A (all) flag, since that gives you the more low-level list you're showing.
The default output is easier to parse, and seems to directly state the values you're after.
Note: I didn't try this, not a system with an AVR toolchain installed.
I guess that in your linker script you have RAM at 0, and Flash at 0x80000000, so all things that need to go to RAM are at addresses 0+ (.stack is the last at 61440 (spanning next 4k)). So you would need a bit more that 64k of RAM. Everything else you have is flash.
That is provided that your linker script is correct.
Also see unwind's comment.
These values are the assembly language sections of the compiled C code. See the docs for the details. This article is also helpful.
The section titled .text represents the instruction section, i.e. the assembly instructions. The .data section represents the size of the variables (ints, arrays, etc.). The size column has the significant info, and it has the size of each section in bytes. The .stack and .heap represent the memory allocated in preparation for the execution of the program to set up the virtual memory.
You can try
avr-nm --print-size --radix d --demangle x.elf
to get the sizes in decimal notation.
Then you can copy & paste into a spreadsheet, filter, sort by the sections, and sum it up.

Will mmap use continuous memory? (on solaris)

I used mmap(just try to understand how mmap works) to allocate 96k anonymous memory, but looks like it split the 96k into 64k and 32k. But when allocate 960k, it allocate only one chunk whose size is 960k. When solaris will split the allocate mem into several part?
Code:
#define PROT PROT_READ | PROT_WRITE
#define MAP MAP_ANON | MAP_PRIVATE
if ((src = mmap(0, 88304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 983040, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 98304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
Truss:
mmap(0x00000000, 88304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E900000
mmap(0x00000000, 983040, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E800000
mmap(0x00000000, 98304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E700000
Pmap:
FFFFFFFF7E700000 64 - - - rw--- [anon]
==> strange is that for 96k, it was broken into 2 part.
FFFFFFFF7E710000 32 - - - rw--- [anon]
FFFFFFFF7E800000 960 - - - rw--- [anon]
FFFFFFFF7E900000 64 - - - rw--- [anon]
FFFFFFFF7E910000 24 - - - rw--- [anon]
FFFFFFFF7EA00000 64 - - - rw--- [anon]
FFFFFFFF7EA10000 32 - - - rw--- [anon]
That is contiguous memory, you can tell by the addresses (F...700000 + 64K = F...710000) so I don't think you have to worry about that. I'm pretty certain that mmap is required to give you contiguous memory in your address space. It would be pretty useless otherwise since it only gives you one base address. With two non-contiguous blocks, there would be no way to find that second block.
So I guess your question is: why does this show up as two blocks in the pmap?
To which my answer would be, "Stuffed if I know". But I can make an intelligent guess which is the best anyone can hope for from me at this time of the morning (pre-coffee).
I would suggest that those blocks had been allocated before to another process (or two) and had been released back to the mmap memory manager. I can see two possibilities on how that memory manager coalesces blocks to make bigger free blocks, either:
it does it as soon as the memory is released (not the case as your output shows that's not happening).
it does it periodically and it hadn't got around to it before you requested your 96K block; or
it doesn't bother at all because it's smart enough to do it during the allocation of a block to you.
I suspect it's the latter simply because the memory manager had no problems giving you two blocks for your request so it's obviously built to handle it. The 960K block is probably not segmented because it came from a much bigger block.
Keep in mind this is speculation (informed, but still speculation). I've seen quite a bit of the internals of UNIX (real UNIXes, not that new kid on the block :-) but I've never had a need to delve into mmap.
I can't remember the term for it (stripes? Slices? wedges? argh) but Solaris allocates different page sizes from pools of various sizes. This turns out to be somewhat more efficient than uniform page sizes, because it uses the memory mapping better. One of those sizes is 32K, another 64K, another is 1024K I believe. To get 96K, you got a 64 and a 32, to get 960 you got most of a 1024K.
The core resource for this wizardry is the Solaris Internals book. Mine, unfortunately, is in a box in the garage at the momrnt.
The answer depends on what you mean by contiguous. Solaris and all modern Unix and unix-like systems (probably all modern operating systems) will divide physical memory into pages, and the memory within a 'page' will be contiguous at the physical level. Most modern systems have a hardware MMU (Memory Management Unit) which will translate a virtual address to a physical address. So the mmap system call will return a contiguous virtual address space but that virtual address will be managed by an MMU which may use multiple pages depending on the size of the page(s) and the size of the memory mapping.
While all the virtual address will be contiguous (within the mapping)
The addresses within 'pages' will also be physically contiguous but the pages and the transitions between pages may not even be close to each other physically.

How can I prevent GD from running out of memory?

I'm not sure if memory is the culprit here. I am trying to instantiate a GD image from data in memory (it previously came from a database). I try a call like this:
my $image = GD::Image->new($image_data);
$image comes back as undef. The POD for GD says that the constructor will return undef for cases of insufficient memory, so that's why I suspect memory.
The image data is in PNG format. The same thing happens if I call newFromPngData.
This works for very small images, like under 30K. However, slightly larger images, like ~70K will cause the problem. I wouldn't think that a 70K image should cause these problems, even after it is deflated.
This script is running under CGI through Apache 2.0, on OS 10.4, if that matters at all.
Are there any memory limitations imposed by Apache by default? Can they be increased?
Thanks for any insight!
EDIT: For clarification, the GD::Image object never gets created, so clearing out the $image_data from memory isn't really an option.
GD library eats many bytes per byte of image size. It's a well over a 10:1 ratio!
When a user uploads an image to our system, we start by checking the file size before loading it into a GD image. If it's over a threshold (1 Megabyte) we don't use it but instead report an error to the user.
If we really cared we could dump it to disk, use the command line "convert" tool to rescale it to a sane size, then load the output into the GD library and remove the temporary file.
convert -define jpeg:size=800x800 tmpfile.jpg -thumbnail '800x800' -
Will scale the image so it fits within an 800 x 800 square. It's longest edge is now 800px which should safely load. The above command will send the shrunk .jpg to STDOUT. The size= option should tell convert not to bother holding the huge image in memory, but just enough to scale to 800x800.
I've run into the same problem a few times.
One of my solutions was simply to increase the amount of memory available to my scripts. The other was to clear the buffer:
Original Script:
$src_img = imagecreatefromstring($userfile2);
imagecopyresampled($dst_img,$src_img,0,0,0,0,$thumb_width,$thumb_height,$origw,$origh);
Edited Script:
$src_img = imagecreatefromstring($userfile2);
imagecopyresampled($dst_img,$src_img,0,0,0,0,$thumb_width,$thumb_height,$origw,$origh);
imagedestroy($src_img);
By clearing out the memory of the first src_image, it freed up enough to handle more processing.