Can someone explain what slabs are in memcached - memcached

Really, I only want to know what are slabs in memcached. And could be better if someine who is working know with can answer me.
Thanks for your answers...

Applications that run for long periods of time, like memcached, run into memory fragmentation issues the longer the service runs. On top of that caching applications have the added issue that there are pieces of memory that have been cached for long periods of time as well as newer pieces of memory that were recently allocated.
Memcached has a "slab" allocator that attempts to reduce memory fragmentation in the memcached process. At a high level a slab is a 1MB piece of memory that contains the values of the key-value pairs you store in memcached. There are also different slabs for different value sizes. There might be a slab for 16B values, a a slab for 32B values, a slab for 1024B values, etc. When a new key-value pair is added memcached puts the value in the smallest slab that will hold the value. By allocating memory like this memcached is able to reduce memory fragmentation and as a result reduce the overall amount of memory used by memcached.
Slabs and the slab allocator are internal memcached implementation details. You can get information about them through the stats command, but unless you're trying to debug an issues with memcached itself inspecting the slab information is unlikely to be useful.
For more details about slabs and the slab allocator I found a blog post linked below.
https://holmeshe.me/understanding-memcached-source-code-I/
If you're particularly interested in this kind of architecture then look into how memory allocators work in general since the concepts are similar.

Related

How to keep 32 bit mongodb memory usage down on changing dataset

I'm using MongoDB on a 32 bit production system, which sucks but it's out of my control right now. The challenge is to keep the memory usage under ~2.5GB since going over this will cause 32 bit systems to crash.
According to the mongoDB team, the best way to track the memory usage is to use your operating system's process tracking system (i.e. ps or htop on Unix systems; Process Explorer on Windows.) for virtual memory size.
The DB mainly consists of one table which is continually cycling data, i.e. receiving data at regular intervals from sensors, and every day a cron job wipes all data from before the last 3 days. Over a period of time, the memory usage slowly increases. I took some notes over time using db.serverStats(), db.lectura.totalSize() and ps, shown in the chart below. Note that the size of the table in question has reduced in the last month but the memory usage increased nonetheless.
Now, there is some scope for adjustment in how many days of data I store. Today I deleted basically half of the data, and then restarted mongodb, and yet the mem virtual / mem mapped and most importantly memory usage according to ps have hardly changed! Why do these not reduce when I wipe data (and restart)? I read some other questions where people said that mongo isn't really using all the memory that it might appear to be using, and that you can't clear the cache or limit memory use. But then how can I ensure I stay under the 2.5GB limit?
Unless there is a way to stem this dataset-size-irrespective gradual increase in memory usage, it seems to me that the 32-bit version of Mongo is unuseable. Note: I don't mind losing a bit of performance if it solves the problem.
To answer regarding why the mapped and virtual memory usage does not decrease with the deletes, the mapped number is actually what you get when you mmap() the entire set of data files. This does not shrink when you delete records, because although the space is freed up inside the data files, they are not themselves reduced in size - the files are just more empty afterwards.
Virtual will include journal files, and connections, and other non-data related memory usage also, but the same principle applies there. This, and more, is described here:
http://www.mongodb.org/display/DOCS/Checking+Server+Memory+Usage
So, the 2GB storage size limitation on 32-bit will actually apply to the data files whether or not there is data in them. To reclaim deleted space, you will have to run a repair. This is a blocking operation and will require the database to be offline/unavailable while it was run. It will also need up to 2x the original size in terms of free disk space to be able to run the repair, since it essentially represents writing out the files again from scratch.
This limitation, and the problems it causes, is why the 32-bit version should not be run in production, it is just not suitable. I would recommend getting onto a 64-bit version as soon as possible.
By the way, neither of these figures (mapped or virtual) actually represents your resident memory usage, which is what you really want to look at. The best way to do this over time is via MMS, which is the free monitoring service provided by 10gen - it will graph virtual, mapped and resident memory for you over time as well as plenty of other stats.
If you want an immediate view, run mongostat and check out the corresponding memory columns (res, mapped, virtual).
In general, when using 64-bit builds with essentially unlimited storage, the data will usually greatly exceed the available memory. Therefore, mongod will use all of the available memory it can in terms of resident memory (which is why you should always have swap configured to the OOM Killer does not come into play).
Once that is used, the OS does not stop allocating memory, it will just have the oldest items paged out to make room for the new data (LRU). In other words, the recycling of memory will be done for you, and the resident memory level will remain fairly constant.
Your options for stretching 32-bit are limited, but you can try some things. The thing that you run out of is address space, and the increases in the sizes of additional database files mean that you would like to avoid crossing over the boundary from "n" files to "n+1". It may be worth structuring your data into more or fewer databases so that you can get the maximum amount of actual data into memory and as little as possible "dead space".
For example, if your database named "mydatabase" consists of the files mydatabase.ns (the namespace file) at 16 MB, mydatabase.0 at 64 MB, mydatabase.1 at 128 MB and mydatabase.2 at 256 MB, then the next file created for this database will be mydatabase.3 at 512 MB. If instead of adding to mydatabase you instead created an additional database "mynewdatabase" it would start life with mynewdatabase.ns at 16 MB and mynewdatabase.0 at 64 MB ... quite a bit smaller than the 512 MB that adding to the original database would be. In fact, you could create 4 new databases for less space than would be consumed by adding a new file to the original database, and because the files are smaller they would be easier to fit into contiguous blocks of memory.
It is a well-known message that 32-bit should not be used for production.
Use 64-bit systems.
Point.

How to make APC cache based on distributed hash tables(like memcache)?

I've read an article about Distributed Hash Tables and seems it's possible to implement such a thing like memcache with APC. As you know APC is much more faster than memcache if we're fetching keys from a single server. So if we make APC distributed we have both performance and distribution. I need some thoughts to start it. Could someone who is familiar with Hash tables explain how to do that? How to make APC like memcache?
If you know something about keyspace partitioning and Overlay network that would be much more better.
Although at the surface both softwares provide a comparable service, their underpinnings are entirely different, and that explains the dramatic difference in performance.
APC is basically a system that allows you to store objects (be it user objects or parsed opcode chunks) in shared memory. Shared memory, in all systems I know, is as fast as local RAM once you obtained a pointer to it.
So, in short, what APC has to do to write or read an object is:
request shm access and obtain a pointer to it
calculate object offset and size in the shm
memcpy that memory zone into a buffer or vice versa
done
Simple, and taking into account that memory bandwidth nowadays is 10's of gigabytes per second, quick.
Due to its distributed nature in a memcache scenario more needs to be done:
client encodes and transmits request
server receives and decodes request
server calculates object offset and size in memcached's memory
server memcpy's that memory zone into a buffer or vice versa
server transmit buffer
client receives and decodes buffer
Now, if we want to distribute APC, the client and server will need to talk to each other. And all of a sudden we find ourselves in a scenario that, with the exception of a few less important details, is identical to the one used by memcache. And all the expensive operations will become necessary again, ie all the copying around, sending through the network stack included.
That's also an explanation why even with a memcache instance running on localhost, without horribly slow gigabit ethernet between the nodes, there is a considerable overhead in what needs to be done to make a distributed system work.
And that's why I'm convinced you're looking at the wrong suspect here, make APC distributed and it will be in the same performance/throughput category.

What is the rationale for small stack even when memory is available?

Recently, I was asked in an interview, why would you have a smaller stack when the available memory has no limit? Why would you have it in 1KB range even when you might have 4GB physical memory? Is this a standard design practice?
The other answers are good; I just thought I'd point out an important misunderstanding inherent in the question. How much physical memory you have is completely irrelevant. Having more physical memory is just an optimization; it prevents having to use disk as storage. The precious resource consumed by a stack is address space, not physical memory. The bits of the stack that aren't being used right now are not even going to reside in physical memory; they'll be paged out to disk. But as soon as they are committed, they are consuming virtual address space.
The smaller your stacks, the more of them you can have. A 1kB stack is pretty useless, as I can't think of an architecture that has pages that small. A more typical size is 128kB-1MB.
Since each thread has its own stack, the number of stacks you can have is an upper limit on the number of threads you can have. Some people complain about the fact that they can't create more than 2000 threads in a standard 2GB address space of a 32-bit Windows process, so it's not surprising that some people would want even smaller stacks to allow even more threads.
Also, consider that if a stack has to be completely reserved ahead of time, it is carving a chunk out of your address space that can't be returned until the stack isn't used anymore (i.e. the thread exits). That chunk of reserved address space then limits the size of a contiguous allocation you can make.
I don't know the "real" answer, but my guess is:
It's committed on-demand.
Do you really need it?
If the system uses 1 MiB for a stack, then a typical system with 1024 threads would be using 1 GiB of memory for (mostly) nothing... which may not be what you want, especially since you don't really need it.
One reason is, even though memory is huge these days, it is still not unlimited. A 32-bit process is normally limited to 4GB of address space (yes, you can use PAE to increase that, but that requires support from the OS and a return to a segmented memory model.) Each thread uses up some of that memory for its stack, and if a stack is megabytes in size -- whether it's paged in or not -- it's taking up a significant part of the app's address space.
The smaller the stack, the more threads you can squeeze into the app, and the more memory you have available for everything else. Ideally, you want a stack just large enough to handle all possible control flows through the thread, but small enough that you don't have wasted address space.
There are two things here. First, the limit on the stack size will put the limit on number of processes/threads in the system. And then too, the limit is not because of the size of physical memory but because of the limit on addressable virtual memory. Secondly, rarely processes/threads need more stack size then that, and if they do, they can ask for it (libraries handle this seamlessly). So, when starting a new process/thread, it makes sense to give them a small stack space.
Other answers here already mention the core concept, that the most significant consumed resource of a stack is address space (since its implementation requires chunks of contiguous address space) and that the default space consumed on windows for each thread is not insignificant.
However the full story is extremely nuanced (and can and will change over time) over many layers and levels.
This article by Mark Russinovich as part of his "Pushing the limits of Windows" series goes into extremely detailed levels of analysis. The work is in no way an introductory article though, and most people would not consider it the sort of thing that would be expected to be known in a job interview unless perhaps you were interviewing for a job in that particular field.
Maybe because everytime you call a function the OS has to allocate memory to be the stack of that function. Because functions can chain, several function calls will incur more stack allocations. A large default stack size, like 4GiB, would be impractical. But that's just my guess...

Reduce Membase quota per bucket to 5 MB

In Heroku, I notice that they limit my free Memcached Bucket (actually Membase) to 5MB. However, I tried it on my own server and cannot set Bucket quota to less than 64MB (per node, and for Memcached bucket type). For Membase bucket type, it's even more: 100MB.
Hmm, my server have a humble amount of RAM. And I need to allocate a very small amount of Memcached. Please advice.
Heroku is running a slightly modified version of our memcached software that lets them keep the bucket overhead very low. Unfortunately the "productized" version has some limits imposed to prevent the software from getting itself into trouble.
Especially for Membase buckets, we need at least 100mb in order to safely run.
You may be able to reduce/eliminate these limits if you recompile the source, but that wouldn't be a supported configuration.
Perry
Sorry for the delay in getting back to this...
As with any piece of software, there are internal data structures that need RAM to run...that's what gets allocated immediately with Membase.
If you install memcached, it will use as much RAM as you configure it to use...no more, no less.

Does it make sense to cache data obtained from a memory mapped file?

Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.
i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.
The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.