How B-trees are optimized for systems that read and write large blocks of data?

It probably looks like a simple question to you, but please help me to clarify my understanding of the whole workflow. One of B-trees advantages is that "B-tree is optimized for systems that read and write large blocks of data" ( I wonder if the following is correct:
Case when B-tree's nodes contain both keys and data (as opposed to containing keys and "pointers" to data stored on disc). Then the whole B-tree should be stored on disc, since large DB implemented by this tree does not fit in memory. Then it is clear why the reading large blocks makes difference: when we are looking for a specific key, we retrieve a block with several "suspicious" keys with one disc access. OK, but what if a new key is inserted: how can we be sure it gets in the same block as the other keys of the node? Wiki suggests usage of partially full blocks to speed up insertions and deletions. Is it the only reasonable way?
Case when B-tree's nodes contain keys and "pointers" to data stored on disc. Then the B-tree and the data are "separated". B-tree can be stored on the disc, or in the memory. If it is stored on the disc, the situation is basically the same. If B-tree is in the memory, then I do not see any advantages in reading large blocks of data. Am I missing something very simple, or such a scheme (data on disc, index B-tree in the memory) is not used at all?
And yes, I did check SO, but still did not quite get it.


why kafka index files use memory mapped files ,but log files don't?

We know that kafka use memory mapped files for it's index files ,however it's log files don't use the memory mapped files technology.
My question is why index files use memory mapped files, however log files don't ?
Implementing both log and index appending with mmap approach will bring data consistency problem. mmap is not 100% guarantee to flush the data from memory to file(assuming the flush reply on OS instead of an explicitly calling on munmap(2)), if the index update get flushed but log data not get flushed successfully due to some reason, the data in the log can not be understood anymore.
BTW, for a append-only data, in the write direction, we only need to care about next-to-write block(buffer), so the huge data should not impact this.
That how many bytes can be mapped into the memory relates to the address space. For example, a 32-bit architecture can only address 4GB or even smaller portions of files. Kafka logs which are often larger enough might have only portions mapped at a time, therefore complicating reading them.
However, index files are sparse which means they are relatively small in size. Mapping them into the memory could speed up the lookup process and that's the primary benefit memory-mapped files offer.
Logs are where the messages are stored, the index files point to the position in the logs.
There is a nice, colorful blog post, explaining what is going on.
Having a fast index to improve read performance is a common optimization in databases where writes are append-only(Almost all LSTM databases do some form of this). Also as others have pointed out:
indexes are sparse, so smaller memory footprint. Even the sparsity of the index is configurable, which is useful as data grows.
Append only write patterns are faster than random seeks(especially true for SSDs), and therefore don't need a lot of attention for optimization.
if mmap log file, as physical memory is limited, it may cause page fault frequently which is a seriously expensive overhead. use sendFile system call is more suitable

Lucene: Loading Index files while searching?

Can anyone explain how index files are loaded in memory while searching?
Is the whole file (fnm, tis, fdt etc) loaded at once or in chunks?
How individual segments are loaded and in which order?
How to encrypt Lucene index?
The main point of having the index segments is that you can rarely load the whole index in the memory.
The most important limitation that is taken into account while designing the index format is that disk seek time is relatively long (on plate-base hard drives, that are still most widely used). A good estimation is that the transfer time per byte is about 0.01 to 0.02 μs, while average seek time of disk head is about 5 ms!
So the part that is kept in memory is typically only the dictionary, used to find out the beginning block of the postings list on the disk*. The other parts are loaded only on-demand and then purged from the memory to make room for other searches.
As for encryption, it depends on whether you need to keep the index encrypted all the time (even when in memory) or if it suffices to encrypt only the index files. As for the latter, I think that an encrypted file system will be enough. As for the former, it is also certainly possible, as different index compression techniques are already in place. However, I don't think it's widely used, as the first and foremost requirement for full-text engine is speed.
[*] It's not really such simple, as we're performing binary searches against the dictionary, so we need to ensure that all entries in the first structure have equal length. As it's clearly not the case with normal words in dictionary and applying padding is too much costly (think of word lengths for some chemical substances), we actually maintain two levels of dictionary, the first one (which needs to fit in the memory and is stored in .tii files) keeps sorted list of starting positions of terms in the second index (.tis files). The second index is then a concatenated array of all terms in an increasing order, along with pointer to the sector in the .frq file. The second index often fits in the memory and is loaded at the start, but it can be impossible e.g. for bigram indexes. Also note that for some time Lucene by default doesn't use individual files, but so called compound files (with .cfs extension) to cut down the number of open files.

Is there a data storage where I can access data directly via array index instead of hash key? Redis? MongoDB?

I need an external C/C++ memory efficient (!) data storage for a Java app which does not have the downside of a normal database lookup (b tree) but which uses my IDs as array index. Is there an open source solution for this? I implemented this in C++ in-memory only, but I would like to have a "storage to disc" option in case of a crash or for backup. Also Java binding would be cool.
E.g. redis looks good but when reading the docs I see that in general things are accessed by hash keys which have O(1) only in theory - or can I somehow force that the hashing scheme matches the storage index? And also lists are not appropriated as they are implemented as linked lists. Or what about mongodb?
And yes, I really need that fast read access (write can be "okayish slow" :)) - it is no premature optimization but if there is no alternative I'll try redis before rolling my own. Also Java is not possible (as I said: memory efficient ;))
With a remote key-value store, the overhead is very often dominated by the network and protocol management rather than data access itself. That's why with efficient key-value stores (like Redis for instance), almost all the operations actually have the same cost.
The Redis benchmark page contains a good illustration of this point.
In other words, in the context of an in-memory remote store, and considering only the latency, a random access array will have the same exact performance than a hash table, and even less efficient O(log n) containers like red-black trees, B-trees, etc ... will be quite close.
If you really want maximum performance, I would suggest to use an embedded (i.e. in-process) store. For instance, both BerkeleyDB and Tokyo Cabinet provide disk based random access containers for fixed-length records.
KDB is the go-to solution for this problem in the financial systems (algo trading) world. Be prepared to have your brain melted by the syntax though. Oh, and it is not open source.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.

How does memcache store data?

I am a newbie to caching and have no idea how data is stored in caching. I have tried to read a few examples online, but everybody is providing code snippets of storing and getting data, rather than explaining how data is cached using memcache. I have read that it stores data in key, value pairs , but I am unable to understand where are those key-value pairs stored?
Also could someone explain why is data going into cache is hashed or encrypted? I am a little confused between serialising data and hashing data.
A couple of quotes from the Memcache page on Wikipedia:
Memcached's APIs provide a giant hash
table distributed across multiple
machines. When the table is full,
subsequent inserts cause older data to
be purged in least recently used (LRU)
The servers keep the values in RAM; if
a server runs out of RAM, it discards
the oldest values. Therefore, clients
must treat Memcached as a transitory
cache; they cannot assume that data
stored in Memcached is still there
when they need it.
The rest of the page on Wikipedia is pretty informative, and it might help you get started.
They are stored in memory on the server, that way if you use the same key/value often and you know they won't change for a while you can store them in memory for faster access.
I'm not deeply familiar with memcached, so take what I have to say with a grain of salt :-)
Memcached is a separate process or set of processes that store a key-value store in-memory so they can be easily accessed later. In a sense, they provide another global scope that can be shared by different aspects of your program, enabling a value to be calculated once, and used in many distinct and separate areas of your program. In another sense, they provide a fast, forgetful database that can be used to store transient data. The data is not stored permanently, but in general it will be stored beyond the life of a particular request (it is possible for Memcached to never store your data, so every read will be a miss, but that's generally an indication that you do not have it set up correctly for your use case).
The data going into cache does not have to be hashed or encrypted (but both things can happen to the data, depending on the caching mechanism.)
Serializing data actually has nothing to do with either concept -- instead, it is the process of changing data from one format (generally one suited for in-memory storage) to another one (generally suitable for storage in a persistent medium.) Another term for this process is marshalling and unmarshalling.