external fragmentation and virtual address fragmentation in windbg - windbg

I am using windbg to debug a memory issues on Win7.
I use !heap -s and got following output.
0:002> !heap -s
LFH Key : 0x6573276f
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
000f0000 00000002 1024 552 1024 257 7 1 0 0 LFH
00010000 00008000 64 4 64 2 1 1 0 0
00330000 00001002 1088 160 1088 5 2 2 0 0 LFH
00460000 00001002 256 4 256 2 1 1 0 0
012c0000 00001002 1088 408 1088 8 10 2 0 0 LFH
00440000 00001002 1088 188 1088 24 9 2 0 0 LFH
01990000 00001002 1088 188 1088 24 9 2 0 0 LFH
00420000 00001002 1088 152 1088 5 2 2 0 0 LFH
01d20000 00001002 64 12 64 3 2 1 0 0
01c80000 00001002 64 12 64 1 2 1 0 0
012e0000 00001002 776448 118128 776448 109939 746 532 0 0 LFH
External fragmentation 93 % (746 free blocks)
Virtual address fragmentation 84 % (532 uncommited ranges)
01900000 00001002 256 4 256 1 1 1 0 0
01fa0000 00001002 256 108 256 58 3 1 0 0
01c40000 00001002 64 16 64 4 1 1 0 0
03140000 00001002 64 12 64 3 2 1 0 0
33f40000 00001002 64 4 64 2 1 1 0 0
340f0000 00001002 1088 164 1088 3 5 2 0 0 LFH
-----------------------------------------------------------------------------
My question is what is External fragmentation and what is Virtual addess fragmentation?
And what does 93% and 84% mean?
Thank you in advance.

The output of WinDbg refers to the heap before the fragmentation numbers, in your case the heap 012e0000.
External fragmentation = 1 - (larget free block / total free size)
This means that the largest free block in that heap is 7.63 MB, although the total free size is 109 MB. This typically means that you can't allocate more than 7.63 MB in that heap at once.
For a detailed description of external fragmentation, see also Wikipedia.
Virtual address fragmentation: 1 - (commit size / virtual size)
While I have not found a good explanation for virtual memory fragmentation, this is an interpretation of the formula: virtual size is the total available memory. Commit size is what's used. The difference (1 - x) is unusable.
You can go into more details on that heap using !heap -f -stat -h <heap> (!heap -f -stat -h 012e0000 in your case).

If you are trying to debug a memory fragmentation problem you should take a look at VMMAP from sysinternals.
http://technet.microsoft.com/en-us/sysinternals/dd535533
Not only you can see there the exact size of the largest free block, but you can also go to "Fragmentation view" in it to see visual presentation of how fragmented your memory is.

Thank Stas Sh 's answer.
I am using VMMap to analyze memory used by a process.
But I am confuse with Private Data displayed in VMMap.
I write a demo app, and use HeapCreate to create a private heap, and then allocate a lot of small blocks from that heap by HeapAlloc.
I use VMMap to analyze this demo app, and follow information is from VMMap.
Process: HeapOS.exe
PID: 2320
Type Size Committed Private Total WS Private WS Shareable WS Shared WS Locked WS Blocks Largest
Total 928,388 806,452 779,360 782,544 779,144 3,400 2,720 188
Heap 1,600 500 488 460 452 8 8 13 1,024
Private Data 888,224 774,016 774,016 774,016 774,012 4 4 24 294,912
I found Heap is very small, but Private Data is very large.
But from Help of VMMap, it explained that
Private
Private memory is memory allocated by VirtualAlloc and not suballocated either by the Heap Manager or the .NET run time.
It cannot be shared with other processes, is charged against the system commit limit, and typically contains application data.
So I guess that Private Data is memory allocate by VirtualAlloc from virtual address space of process, and just can't be shared with other process. Private Data may be allocate by code of app, or by Heap Manager of OS or by .NET runtime.

Related

Postgresql : Maintaining the database and knowing more about relation/impact with each other Autovacuum , backend_xmin, datfrozenxid

I am newbie to Postgresql, My project is in financial transactions having a few tables with huge transaction data which will have frequent insert/update/delete on it.
Initially when I started, came across an error that auto-vacuum is not working properly and noticed that the database size has turned huge and was creating memory issues so the team had increased the RAM to 24GB RAM from 16GB and the hard disk is 324GB.
After going through the system and pg_log errors which were all related to vacuum not working properly and displaying below error:
TopMemoryContext: 61752 total in 8 blocks; 8400 free (10 chunks); 53352 used
TopTransactionContext: 8192 total in 1 blocks; 7856 free (24 chunks); 336 used
TOAST to main relid map: 57344 total in 3 blocks; 34480 free (11 chunks); 22864 used
AV worker: 24576 total in 2 blocks; 17608 free (9 chunks); 6968 used
Autovacuum Portal: 8192 total in 1 blocks; 8168 free (0 chunks); 24 used
Vacuum: 8192 total in 1 blocks; 8128 free (0 chunks); 64 used
Operator class cache: 8192 total in 1 blocks; 4864 free (0 chunks); 3328 used
smgr relation table: 8192 total in 1 blocks; 2808 free (0 chunks); 5384 used
TransactionAbortContext: 32768 total in 1 blocks; 32744 free (0 chunks); 24 used
Portal hash: 8192 total in 1 blocks; 3904 free (0 chunks); 4288 used
PortalMemory: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 2848 free (0 chunks); 5344 used
CacheMemoryContext: 516096 total in 6 blocks; 231840 free (1 chunks); 284256 used
log_trace_data_date_time_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
log_trace_data_destination_conn_id_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
log_trace_data_source_conn_id_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
log_trace_data_trace_number_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
log_trace_data_reference_retrieval_number_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used....
pg_database_datname_index: 1024 total in 1 blocks; 544 free (0 chunks); 480 used
pg_replication_origin_roiident_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 512 free (0 chunks); 512 used
pg_database_oid_index: 1024 total in 1 blocks; 544 free (0 chunks); 480 used
pg_authid_rolname_index: 1024 total in 1 blocks; 584 free (0 chunks); 440 used
WAL record construction: 49528 total in 2 blocks; 6872 free (0 chunks); 42656 used
PrivateRefCount: 8192 total in 1 blocks; 5960 free (0 chunks); 2232 used
MdSmgr: 8192 total in 1 blocks; 7448 free (0 chunks); 744 used
LOCALLOCK hash: 8192 total in 1 blocks; 4928 free (0 chunks); 3264 used
Timezones: 104064 total in 2 blocks; 5960 free (0 chunks); 98104 used
ErrorContext: 8192 total in 1 blocks; 8168 free (0 chunks); 24 used
xxxx 00:00:03 EST ERROR: out of memory
xxxx 00:00:03 EST DETAIL: Failed on request of size 503316480.
xxxx 00:00:03 EST CONTEXT: automatic vacuum of table "novus.log.trace_data"
I got to know that vacuum locks the database, but I was not sure of any other way to check out, so I tried running "vacuum full tablename" (on those 3 huge tables that has DML functions frequently) once and check if there is any change?
Output: Database size which had increased to 125GB after vacuum was reduced to 70GB and auto vacuum started working fine until few days ago. I am not sure if I did it right and now, I see some things related to oldest xmin in the log. As I am still going through various articles and understanding things in PGSQL, could you please help me understand a few things in a better way. And if I did that right or is there any other way to resolve this autovacuum error. Logs are displaying various types of errors and warnings now, also some servers have pgsql web version and I am not able to read that log file properly.
Some errors displayed in pg_log files:
Checkpoints are occuring too frequently = > this was appearing and vanished later automatically (max_wal_size had tried toying with this value too)
could not rename temporary statistics file "pg_stat_tmp/global.tmp" to "pg_stat_tmp/global.stat": Permission denied
canceling autovacuum task
I would like to know more on
How to resolve bloating?
Even though I see autovacuum has started working, I am not sure if there are records that are actually getting deleted and releasing memory. Are the deleted transactions not reaching invisible/frozen state???
Are these all side effects of my initial changes that I tried.
Project: Database is on 2 servers and the transactions are saved in productiondatabase "spool" table and the transactions are later moved to supportserver's database "transaction" table. Now the issue is spool table is getting huge and transactions table is not able to keep up as the autovacuum was not working properly.

How can I read Total bytes per sector value using PowerShell or Cmd?

I heard that the disk sector's size is not 512 bytes, but 571 bytes.
Total bytes per sector: 571 bytes (59 B for Header + 512 B for Data)
Data bytes per sector: 512 bytes
And I want to query using PowerShell or Cmd Prompt to view that number '571' .
P.S. Failed with fsutil fsinfo ntfsinfo [Drive Letter]

Units of perf stat statistics

I'm using perf stat for some purposes and to better understand the working of the tool , I wrote a program that copies a file's contents into another . I ran the program on a 750MB file and the stats are below
31691336329 L1-dcache-loads
44227451 L1-dcache-load-misses
15596746809 L1-dcache-stores
20575093 L1-dcache-store-misses
26542169 cache-references
13410669 cache-misses
36859313200 cycles
75952288765 instructions
26542163 cache-references
what is the units of each number . what I mean is . Is it bits/bytes/ or something else . Thanks in advance.
The unit is "single cache access" for loads, stores, references and misses. Loads correspond to count of load instructions, executed by processors; same for stores. Misses is the count, how much loads and stores were unable to get their data loaded from the cache of this level: L1 data cache for L1-dcache- events; Last Level Cache (usually L2 or L3 depending on your platform) for cache- events.
31 691 336 329 L1-dcache-loads
44 227 451 L1-dcache-load-misses
15 596 746 809 L1-dcache-stores
20 575 093 L1-dcache-store-misses
26 542 169 cache-references
13 410 669 cache-misses
Cycles is the total count of CPU ticks, for which CPU executed your program. If you have 3 GHz CPU, there will be around 3 000 000 000 cycles per second at most. If the machine was busy, there will be less cycles available for your program
36 859 313 200 cycles
This is total count of instructions, executed from your program:
75 952 288 765 instructions
(I will use G suffix as abbreviation for billion)
From the numbers we can conclude: 76G instructions executed in 37G cycles (around 2 instructions per cpu tick, rather high level of IPC). You gave no information of your CPU and its frequency, but assuming 3 GHz CPU, the running time was near 12 seconds.
In 76G instructions, you have 31G load instructions (42%), and 15G store instructions (21%); so only 37% of instructions were no memory instructions. I don't know, what was the size of memory references (was it byte load and stores, 2 byte or wide SSE movs), but 31G load instructions looks too high for 750 MB file (mean is 0.02 bytes; but shortest possible load and store is single byte). So I think that your program did several copies of the data; or the file was bigger. 750 MB in 12 seconds looks rather slow (60 MBytes/s), but this can be true, if the first file was read and second file was written to the disk, without caching by Linux kernel (do you have fsync() call in your program? Do you profile your CPU or your HDD?). With cached files and/or RAMdrive (tmpfs - the filesystem, stored in the RAM memory) this speed should be much higher.
Modern versions of perf does some simple calculations in perf stat and also may print units, like shown here: http://www.bnikolic.co.uk/blog/hpc-prof-events.html
perf stat -d md5sum *
578.920753 task-clock # 0.995 CPUs utilized
211 context-switches # 0.000 M/sec
4 CPU-migrations # 0.000 M/sec
212 page-faults # 0.000 M/sec
1,744,441,333 cycles # 3.013 GHz [20.22%]
1,064,408,505 stalled-cycles-frontend # 61.02% frontend cycles idle [30.68%]
104,014,063 stalled-cycles-backend # 5.96% backend cycles idle [41.00%]
2,401,954,846 instructions # 1.38 insns per cycle
# 0.44 stalled cycles per insn [51.18%]
14,519,547 branches # 25.080 M/sec [61.21%]
109,768 branch-misses # 0.76% of all branches [61.48%]
266,601,318 L1-dcache-loads # 460.514 M/sec [50.90%]
13,539,746 L1-dcache-load-misses # 5.08% of all L1-dcache hits [50.21%]
0 LLC-loads # 0.000 M/sec [39.19%]
(wrongevent?)0 LLC-load-misses # 0.00% of all LL-cache hits [ 9.63%]
0.581869522 seconds time elapsed
UPDATE Apr 18, 2014
please explain why cache-references are not correlating with L1-dcache numbers
Cache-references DOES correlate with L1-dcache numbers. cache-references is close to L1-dcache-store-misses or L1-dcache-load-misses. Why numbers are no equal? Because in your CPU (Core i5-2320) there are 3 levels of cache: L1, L2, L3; and LLC (last level cache) is L3. So, load or store instruction at first trys to get/save data in/from L1 cache (L1-dcache-loads, L1-dcache-stores). If address was not cached in L1, the request will go to L2 (L1-dcache-load-misses, L1-dcache-store-misses). In this run we have no exact data of how much request were served by L2 (the counters were not included into default set in perf stat). But we can assume that some loads/stores were served and some were not. Then not served-by-L2 requests will go to L3 (LLC), and we see that there were 26M references to L3 (cache-references) and half of them (13M) were L3 misses (cache-misses; served by main RAM memory). Another half were L3 hits.
44M+20M = 64M misses from L1 were passed to L2. 26M requests were passed from L2 to L3 - they are L2 misses. So 64M-26M = 38 millions requests were served by L2 (l2 hits).

perl cpu profiling

I want to profile my perl script for cpu time. I found out Devel::Nytprof and Devel::SmallProf
but the first one cannot show the cpu time and the second one works bad. At least I couldn't find what I need.
Can you advise any tool for my purposes?
UPD: I need per line profiling/ Since my script takes a lot of cpu time and I want to improve the part of it
You could try your system's (not shell's internal!) time utility (leading \ is not a typo):
$ \time -v perl collatz.pl
13 40 20 10 5 16 8 4 2 1
23 70 35 106 53 160 80 40
837799 525
Command being timed: "perl collatz.pl"
User time (seconds): 3.79
System time (seconds): 0.06
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.94
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 171808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 14851
Voluntary context switches: 16
Involuntary context switches: 935
Swaps: 0
File system inputs: 1120
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

PCIe 64-bit Non-Prefetchable Spaces

I've been reading through the horror that is the PCIe spec, and still can't get any kind of resolution to the following question pair.
Does PCIe allow for mapping huge (say 16GB) 64-bit non-prefetchable memory spaces up above the 4GB boundary? Or are they still bound to the same 1GB that they were in the 32-bit days, and there's just no way to call for giant swaths of non-prefetchable space?
Assuming that the spec allows for it (and to my reading it does), do widely available BIOSes support it? Or is it allowed in theory but not done in practice?
TL;DR / Short Answer
No. BAR requests for non-prefetchable memory are limited to using the low 32-bit address space.
http://www.pcisig.com/reflector/msg03550.html
Long Answer
The reason why the answer is no has to do with PCI internals. The data structure which describes the memory ranges that a PCI bus encompasses only reserves enough space to store 32-bit base and limit addresses for non-prefetchable memory and for I/O memory ranges. However, it does reserve enough space to store a 64-bit base and limit for prefetchable memory.
Even Longer Answer
Specifically, look at http://wiki.osdev.org/PCI#PCI_Device_Structure, Figure 3 (PCI-to-PCI bridge). This shows a PCI Configuration Space Header Type 0x01 (the header format for a PCI-to-PCI bridge). Notice that starting at register 1C in that table, there are:
1C: 8 (middle) bits for I/O base address. Only top 4 bits are usable.
1D: 8 (middle) bits for I/O limit address. Only top 4 bits are usable.
Ignore 1E-1F.
20: 16 bits for non-prefetchable memory base address. Only top 12 bits are usable.
22: 16 bits for non-prefetchable memory limit address. Only top 12 bits are usable.
24: 16 (middle) bits for prefetchable memory base address
26: 16 (middle) bits for prefetchable memory limit address
28: 32 upper bits for extended prefetchable memory base address
2C: 32 upper bits for extended prefetchable memory limit address
30: 16 upper bits for extended I/O base address
32: 16 upper bits for extended I/O limit address
The actual addresses are created by concatenating (parts of) these registers together with either 0s (for base addresses) or 1's (for limit addresses). The I/O and non-prefetchable base and limit addresses are 32-bits and formed thus:
Bit# 31 20 19 16 15 0
I/O Base: [ 16 upper bits : 4 middle bits : 12 zeros ]
I/O Limit: [ 16 upper bits : 4 middle bits : 12 ones ]
Non-prefetchable Base: [ 12 bits : 20 zeros ]
Non-prefetchable Limit: [ 12 bits : 20 ones ]
The prefetchable base and limit addresses are 64-bit and formed thus:
Prefetchable Base:
Bit# 63 32
[ 32 upper bits ]
[ 12 middle bits : 20 zeros ]
Bit# 31 16 15 0
Prefetchable Limit:
Bit# 63 32
[ 32 upper bits ]
[ 12 middle bits : 20 ones ]
Bit# 31 16 15 0
As you can see, only the prefetchable memory base and limit registers are given enough bits to express a 64-bit address. All the other ones are limited to only 32.
PCIe can define 64b memory addressing.
The BARs (Base Address Registers) definition and usage is defined in the PCI 3.0 spec (chapter 6.2.5.1 "Address Maps") not in the PCIe spec.