I want to profile my perl script for cpu time. I found out Devel::Nytprof and Devel::SmallProf
but the first one cannot show the cpu time and the second one works bad. At least I couldn't find what I need.
Can you advise any tool for my purposes?
UPD: I need per line profiling/ Since my script takes a lot of cpu time and I want to improve the part of it
You could try your system's (not shell's internal!) time utility (leading \ is not a typo):
$ \time -v perl collatz.pl
13 40 20 10 5 16 8 4 2 1
23 70 35 106 53 160 80 40
837799 525
Command being timed: "perl collatz.pl"
User time (seconds): 3.79
System time (seconds): 0.06
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 171808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 14851
Voluntary context switches: 16
Involuntary context switches: 935
Swaps: 0
File system inputs: 1120
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Related
I am trying to benchmark Kafka Cluster. I am newbie. I build 3 node-cluster. Each node has one partition. I did not change default broker settings. I just copied producer and consumer code directly from official website.
When i create topic with replication 1 and partition 3, i was able to 170 MB per sec. throughput. When i create topic with replication 3 and parititon 3, i hardly see 30 MB per seconds throughput.
Then i applied production config in this link https://kafka.apache.org/documentation#prodconfig. The result got worse.
Can you share your experience with me?
disk type replication insert count one message length elapsed time req per sec concurreny throughput MB
hdd 1 10,000,000 250 25 400,000 1 95.36743164
hdd 1 10,000,000 250 28 357,000 2 170.2308655
hdd 1 10,000,000 250 55 175,000 4 166.8930054
hdd 1 1,000,000 250 22 45,400 8 86.59362793
hdd 1 10,000,000 250 22 85,000 8 162.1246338
hdd 3 1,000,000 250 10 100,000 1 23.84185791
hdd 3 1,000,000 250 19 55,000 2 26.2260437
hdd 3 1,000,000 250 30 32,000 4 30.51757813
hdd 3 1,000,000 250 45 20,000 8 38.14697266
hdd 3 10,000,000 250 559 18,000 8 34.33227539
You should expect performance to decrease when increasing replication. You're initial run had such high throughput because Kafka didn't need to copy the message data to multiple different partitions. When you increase the replication factor you're basically trading speed for durability.
I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below
31691336329 L1-dcache-loads
44227451 L1-dcache-load-misses
15596746809 L1-dcache-stores
20575093 L1-dcache-store-misses
26542169 cache-references
13410669 cache-misses
36859313200 cycles
75952288765 instructions
26542163 cache-references
what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.
The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.
31 691 336 329 L1-dcache-loads
44 227 451 L1-dcache-load-misses
15 596 746 809 L1-dcache-stores
20 575 093 L1-dcache-store-misses
26 542 169 cache-references
13 410 669 cache-misses
Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program
36 859 313 200 cycles
This is total count of instructions, executed from your program:
75 952 288 765 instructions
(I will use G suffix as abbreviation for billion)
From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.
In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.
Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html
perf stat -d md5sum *
578.920753 task-clock # 0.995 CPUs utilized
211 context-switches # 0.000 M/sec
4 CPU-migrations # 0.000 M/sec
212 page-faults # 0.000 M/sec
1,744,441,333 cycles # 3.013 GHz [20.22%]
1,064,408,505 stalled-cycles-frontend # 61.02% frontend cycles idle [30.68%]
104,014,063 stalled-cycles-backend # 5.96% backend cycles idle [41.00%]
2,401,954,846 instructions # 1.38 insns per cycle
# 0.44 stalled cycles per insn [51.18%]
14,519,547 branches # 25.080 M/sec [61.21%]
109,768 branch-misses # 0.76% of all branches [61.48%]
266,601,318 L1-dcache-loads # 460.514 M/sec [50.90%]
13,539,746 L1-dcache-load-misses # 5.08% of all L1-dcache hits [50.21%]
0 LLC-loads # 0.000 M/sec [39.19%]
(wrongevent?)0 LLC-load-misses # 0.00% of all LL-cache hits [ 9.63%]
0.581869522 seconds time elapsed
UPDATE Apr 18, 2014
please explain why cache-references are not correlating with L1-dcache numbers
Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.
44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).
I am using windbg to debug a memory issues on Win7.
I use !heap -s and got following output.
0:002> !heap -s
LFH Key : 0x6573276f
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
000f0000 00000002 1024 552 1024 257 7 1 0 0 LFH
00010000 00008000 64 4 64 2 1 1 0 0
00330000 00001002 1088 160 1088 5 2 2 0 0 LFH
00460000 00001002 256 4 256 2 1 1 0 0
012c0000 00001002 1088 408 1088 8 10 2 0 0 LFH
00440000 00001002 1088 188 1088 24 9 2 0 0 LFH
01990000 00001002 1088 188 1088 24 9 2 0 0 LFH
00420000 00001002 1088 152 1088 5 2 2 0 0 LFH
01d20000 00001002 64 12 64 3 2 1 0 0
01c80000 00001002 64 12 64 1 2 1 0 0
012e0000 00001002 776448 118128 776448 109939 746 532 0 0 LFH
External fragmentation 93 % (746 free blocks)
Virtual address fragmentation 84 % (532 uncommited ranges)
01900000 00001002 256 4 256 1 1 1 0 0
01fa0000 00001002 256 108 256 58 3 1 0 0
01c40000 00001002 64 16 64 4 1 1 0 0
03140000 00001002 64 12 64 3 2 1 0 0
33f40000 00001002 64 4 64 2 1 1 0 0
340f0000 00001002 1088 164 1088 3 5 2 0 0 LFH
-----------------------------------------------------------------------------
My question is what is External fragmentation and what is Virtual addess fragmentation?
And what does 93% and 84% mean?
Thank you in advance.
The output of WinDbg refers to the heap before the fragmentation numbers, in your case the heap 012e0000.
External fragmentation = 1 - (larget free block / total free size)
This means that the largest free block in that heap is 7.63 MB, although the total free size is 109 MB. This typically means that you can't allocate more than 7.63 MB in that heap at once.
For a detailed description of external fragmentation, see also Wikipedia.
Virtual address fragmentation: 1 - (commit size / virtual size)
While I have not found a good explanation for virtual memory fragmentation, this is an interpretation of the formula: virtual size is the total available memory. Commit size is what's used. The difference (1 - x) is unusable.
You can go into more details on that heap using !heap -f -stat -h <heap> (!heap -f -stat -h 012e0000 in your case).
If you are trying to debug a memory fragmentation problem you should take a look at VMMAP from sysinternals.
http://technet.microsoft.com/en-us/sysinternals/dd535533
Not only you can see there the exact size of the largest free block, but you can also go to "Fragmentation view" in it to see visual presentation of how fragmented your memory is.
Thank Stas Sh 's answer.
I am using VMMap to analyze memory used by a process.
But I am confuse with Private Data displayed in VMMap.
I write a demo app, and use HeapCreate to create a private heap, and then allocate a lot of small blocks from that heap by HeapAlloc.
I use VMMap to analyze this demo app, and follow information is from VMMap.
Process: HeapOS.exe
PID: 2320
Type Size Committed Private Total WS Private WS Shareable WS Shared WS Locked WS Blocks Largest
Total 928,388 806,452 779,360 782,544 779,144 3,400 2,720 188
Heap 1,600 500 488 460 452 8 8 13 1,024
Private Data 888,224 774,016 774,016 774,016 774,012 4 4 24 294,912
I found Heap is very small, but Private Data is very large.
But from Help of VMMap, it explained that
Private
Private memory is memory allocated by VirtualAlloc and not suballocated either by the Heap Manager or the .NET run time.
It cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.
So I guess that Private Data is memory allocate by VirtualAlloc from virtual address space of process, and just can't be shared with other process. Private Data may be allocate by code of app, or by Heap Manager of OS or by .NET runtime.
I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .
Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.
The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50
My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?
Suppose that one instructions requires 10 clock cycles from fetch state to write back state. And we want to calculate the time required to execute 1,000,000 instructions. Each clock cycle takes 2 ns.
(a) Calculate the time required.
The answer says that 1,000,009*2 ns. The last digit 9 is for the number of clock cycles for filling the pipeline. Why is this?? I thought since each instruction fetch is happenin in each clock cycle, it would be 1000000*2 ns.
1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9 0
Let's consider these three instructions.Here you can see for the first instruction it has taken 10 clock cycles and and when coming to next two it will only take 2 more clock cycles, so that for the rest 999 999 instructions it will take more 999 999 clock cycles.Therefore 1 000 000 instructions it will take (10+999 999) 1 000 009 clock cycles.