Concept of "block size" in a cache - operating-system

I am just beginning to learn the concept of Direct mapped and Set Associative Caches.
I have some very elementary doubts . Here goes.
Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much data is actually stored inside the "block"? If i have an instruction which loads from a value from a memory location and if that value is a 16-bit integer, is it that one of the 64Byte blocks now stores only a 16 bit(2Bytes) integer value. What of the other 62 bytes within the block? If i now have another load instruction which also loads a 16bit integer value, this value now goes into another block of another frame depending on the load address(If the address maps to the same frame of the previous instruction, then the previous value is evicted and the block again stores only 2bytes in 64 bytes). Correct?
Please forgive me if this seems like a very stupid doubt, its just that i want to get my concepts correctly.

I typed up this email for someone to explain caches, but I think you might find it useful as well.
You have 32-bit addresses that can refer to bytes in RAM.
You want to be able to cache the data that you access, to use them later.
Let's say you want a 1-MiB (220 bytes) cache.
What do you do?
You have 2 restrictions you need to meet:
Caching should be as uniform as possible across all addresses. i.e. you don't want to bias toward any particular kind of address.
How do you do this? Use remainder! With mod, you can evenly distribute any integer over whatever range you want.
You want to help minimize bookkeeping costs. That means e.g. if you're caching in blocks of 1 byte, you don't want to store 4 bytes of data just to keep track of where 1 byte belongs to.
How do you do that? You store blocks that are bigger than just 1 byte.
Let's say you choose 16-byte (24-byte) blocks. That means you can cache 220 / 24 = 216 = 65,536 blocks of data.
You now have a few options:
You can design the cache so that data from any memory block could be stored in any of the cache blocks. This would be called a fully-associative cache.
The benefit is that it's the "fairest" kind of cache: all blocks are treated completely equally.
The tradeoff is speed: To find where to put the memory block, you have to search every cache block for a free space. This is really slow.
You can design the cache so that data from any memory block could only be stored in a single cache block. This would be called a direct-mapped cache.
The benefit is that it's the fastest kind of cache: you do only 1 check to see if the item is in the cache or not.
The tradeoff is that, now, if you happen to have a bad memory access pattern, you can have 2 blocks kicking each other out successively, with unused blocks still remaining in the cache.
You can do a mixture of both: map a single memory block into multiple blocks. This is what real processors do -- they have N-way set associative caches.
Direct-mapped cache:
Now you have 65,536 blocks of data, each block being of 16 bytes.
You store it as 65,536 "rows" inside your cache, with each "row" consisting of the data itself, along with the metadata (regarding where the block belongs, whether it's valid, whether it's been written to, etc.).
Question:
How does each block in memory get mapped to each block in the cache?
Answer:
Well, you're using a direct-mapped cache, using mod. That means addresses 0 to 15 will be mapped to block 0 in the cache; 16-31 get mapped to block 2, etc... and it wraps around as you reach the 1-MiB mark.
So, given memory address M, how do you find the row number N? Easy: N = M % 220 / 24.
But that only tells you where to store the data, not how to retrieve it. Once you've stored it, and try to access it again, you have to know which 1-MB portion of memory was stored here, right?
So that's one piece of metadata: the tag bits. If it's in row N, all you need to know is what the quotient was, during the mod operation. Which, for a 32-bit address, is 12 bits big (since the remainder is 20 bits).
So your tag becomes 12 bits long -- specifically, the topmost 12 bits of any memory address.
And you already knew that the lowermost 4 bits are used for the offset within a block (since memory is byte-addressed, and a block is 16 bytes).
That leaves 16 bits for the "index" bits of a memory address, which can be used to find which row the address belongs to. (It's just a division + remainder operation, but in binary.)
You also need other bits: e.g. you need to know whether a block is in fact valid or not, because when the CPU is turned on, it contains invalid data. So you add 1 bit of metadata: the Valid bit.
There's other bits you'll learn about, used for optimization, synchronization, etc... but these are the basic ones. :)

I'm assuming you know the basics of tag, index, and offset but here's a short explanation as I have learned in my computer architecture class. Blocks are replaced in 64 byte blocks, so every time a new block is put into cache it replaces all 64 bytes regardless if you only need one byte. That's why when addressing the cache there is an offset that specifies the byte you want to get from the block. Take your example, if only 16 bit integer is being loaded, the cache will search for the block by the index, check the tag to make sure its the right data and then get the byte according to the offset. Now if you load another 16 bit value, lets say with the same index but different tag, it will replace the 64 byte block with the new block and get the info from the specified offset. (assuming direct mapped)
I hope this helps! If you need more info or this is still fuzzy let me know, I know a couple of good sites that do a good job of teaching this.

Related

Maximum String Length in Intersystems Cache. Does changing it affect speed?

I have been having <MAXSTRING> errors returned for some of our existing Intersystems Cache Classes
I have read here that by default, the length of max string is set to around 32k. Running the script WRITE $SYSTEM.SYS.MaxLocalLength() does confirm this at 32767, the minimum max-string length.
My question is, if we change this setting in Intersystems Cache (for example making it reach it's maximum at 3m length), will it affect the speed of the server (in general) negatively? or won't it make much difference?
Around an average of 500 people use the system regularly and make use of the class methods mentioned, if that matters
The documents mention the following:
When a process actually uses a long string, the memory for the string comes from the operating system’s malloc() buffer, not from the partition memory space for the process. Thus the memory allocated for actual long string values is not subject to the limit set by the maximum memory per process (Maximum per Process Memory (KB)) parameter and does not affect the $STORAGE value for the process.
However, I am not entirely sure what this means if we change the size of the string.
We switched to long strings (3MB) a few years ago and did not notice any difference in performance.

What is the Riak per-key overhead using the Bitcask backend?

It's a simple question with apparently a multitude of answers.
Findings have ranged anywhere from:
a. 22 bytes as per Basho's documentation:
http://docs.basho.com/riak/latest/references/appendices/Bitcask-Capacity-Planning/
b. 450~ bytes over here:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-August/005178.html
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/004292.html
c. And anecdotal records that state overheads anywhere in the range of 45 to 200 bytes.
Why isn't there a straight answer to this? I understand it's an intricate problem - one of the mailing list entries above makes it clear! - but is even coming up with a consistent ballpark so difficult? Why isn't Basho's documentation clear about this?
I have another set of problems related to how I am to structure my logic based on the key overhead (storing lots of small values versus "collecting" them in larger structures), but I guess that's another question.
The static overhead is stated on our capacity planner as 22 bytes because that's the size of the C struct. As noted on that page, the capacity planner is simply providing a rough estimate for sizing.
The old post on the mailing list by Nico you link to is probably the best complete accounting of bitcask internals you will find and is accurate. Figuring in the 8bytes for a pointer to the entry and the 13bytes of erlang overhead on the bucket/key pair you arrive at 43 bytes on a 64 bit system.
As for there not being a straight answer ... actually asking us (via email, the mailing list, IRC, carrier pigeon, etc) will always produce an actual answer.
Bitcask requires all keys to be held in memory. As far as I can see the overhead referenced in a) is the one to be used when estimating the total amount of RAM bitcask will require across the cluster due to this requirement.
When writing data to disk, Riak stores the actual value together with various metadata, e.g. the vector clock. The post mentioning 450 bytes listed in b) appears to be an estimate of the storage overhead on disk and would therefore probably apply also to other backends.
Nico's post seems to contain a good and accurate explanation.

Need a 9 char length unique ID

My application uses a 9 digit number (it can be alphanumeric also). I can start with any number and then increments it at the beginning. But my application is not a single instance application, so if I run this exe as another instance, it should increment the latest value and the previous instance should again increment the latest value when it needs that value. I mean at all time, the value should be latest incremented value among all the instances that I open.
This is half of the problem. The other side is, exes can be run on any machine on the network and each instance should keep on incrementing (just like time never goes back) for another 2 years. My restrictions is that I can't use files to store and retrieve the latest value in common place.
How can I do that?
A 9 char/digit UNIQUE NUMBER also works for sure. The whole idea is to assign a number (String of 9 char length) to each "confidential file" and (encrypt it and whatever, which is not my job)
I tried with:
GUID which is unique in total 128 bits but not with last or first 9 chars
Tick count more than 9
MAC address unique only if 12 chars
ISBN (book numbering system)
And so on ...
I think the best approach might be to have unique number server which each instance of you application queries over the network to get unique numbers.
First, you need to remove the distributed aspect from the problem. Like user Hugo suggested, using the last 2 or 3 bytes of the IP address should work. Your problem is now reduced to a local problem for each single machine.
Your algorithm probably needs to be able to deal with a restart, and not start handing out the same numbers after a reboot. You state that you do not have the option to use a file to store and retrieve information about this mechanism via a file system. This means that a random number generator alone would not be good enough, and you need a time-based component in your number generator as well. If you use 4 bytes containing the number of seconds elapsed since some date you will have more than 100 years of uniqueness in that. However, ideally the time-scale to use here depends on the expected handout-frequency of your numbers. Your problem is now reduced to a local problem for each single machine for each single second.
The final 2 or 3 bytes are then available to ensure local uniqueness for the second. Depending on your requirements and operating system, there are multiple IPC mechanisms to manage this, like pipes, sockets or shared memory. Or you could think of more creative ways. If you know the number of participating processes on a node, you could assign a sequence number to each process at startup or configuration time, and 1 of the 2 or 3 bytes is used for that. Your uniqueness problem has now become local to your process to one second only, which should be doable.
Why does it have to be EXACTLY 9? UUIDs would be great if that didn't limit you.
In any case, your best shot is to generate a random number. If all your PCs are in the same network, use the host-digits of the IP address at the begining to avoid colission. This should be no more than 16 or 24 bits in most cases anyway, so you have 6 remaining digits.

Lucene: Loading Index files while searching?

Can anyone explain how index files are loaded in memory while searching?
Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?
How individual segments are loaded and in which order?
How to encrypt Lucene index?
The main point of having the index segments is that you can rarely load the whole index in the memory.
The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!
So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.
As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.
[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

operating systems - TLBs

I'm trying to get my head round this (okay, tbh cramming a night before the exams :) but i can't figure out (nor find a good high level overview on the net) of this:
'page table entries can be mapped to more than one TLB entry.. if for example every page table entry is mappped to two TLB entries, this is know as 2-way set associative TLB'
My question is, why would we want to map this more than once? surely we want to have the maximum number of possible entries represented in the TLB, and duplication would waste space right ? What am i missing?
Many thanks
It doesn't mean you would load the same entry into two places into the table -- it means a particular entry can be loaded to either of two places in the table. The alternative where you can only map an entry to one place in the table is a direct mapped TLB.
The primary disadvantage of a direct-mapped TLB arises if you're copying from one part of memory to another, and (by whatever direct-mapping scheme the CPU uses) the translations for both have to be mapped to the same spot in the TLB. In this case, you end up re-loading the TLB entry every time, so the TLB is doing little or no good at all. By having a two-way set associative TLB, you can guarantee that any two entries can be in the TLB at the same time so (for example) a block move from point A to point B can't ruin your day -- but if you read from two areas, combine them, and write results to a third it could (if all three used translations that map map to the same set of TLB entries).
The shortcoming of having a multiway TLB (like any other multiway cache) is that you can't directly compute which position might hold a particular entry at a given time -- you basically search across the ways to find the right entry. For two-way, that's rarely a problem -- but four ways is typically about the useful limit; 8-way set associative (TLBs | caches) aren't common at all, partly because searching across 8 possible locations for the data starts to become excessive.
Over time, the number of ways it makes sense to use in a cache or tlb tends to rise though. The differential in speed between memory and processors continues to rise. The greater the differential, the more cycles the CPU can use and still produce a result within a single memory clock cycle (or a specified number of memory clock cycles, even if that's more than one).