Is there any java memory structures that will automatically page data to disk? - memcached

Basically, I am caching a bunch of files in memory. The problem is, if I get too many files cached into memory I can run out of memory.
Is there any sort of java memory structure that will automatically page part of itself to disk?
For example, I would like to set a 2 mb limit the size of the files in memory. After that limit I would like some of the files to be written to disk.
Is there any library that does this sort of thing?
Grae

‎"files in memory". Conventionally, in memory data is stored in some data structure like a HashMap or whatever, and referred to as a 'file' when its written to disc. You could code a data storage class which did this programmatically. I don't know of any library which did this. It would be pretty useful. In effect you would be implementing virtual memory.
This link will might help you :
http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html

EhCache is general purpose caching library for Java. One of the option is to have disk-backed cache, which overflows to file system. Seems to be exactly what you need.

Related

why kafka index files use memory mapped files ,but log files don't?

We know that kafka use memory mapped files for it's index files ,however it's log files don't use the memory mapped files technology.
My question is why index files use memory mapped files, however log files don't ?
Implementing both log and index appending with mmap approach will bring data consistency problem. mmap is not 100% guarantee to flush the data from memory to file(assuming the flush reply on OS instead of an explicitly calling on munmap(2)), if the index update get flushed but log data not get flushed successfully due to some reason, the data in the log can not be understood anymore.
BTW, for a append-only data, in the write direction, we only need to care about next-to-write block(buffer), so the huge data should not impact this.
That how many bytes can be mapped into the memory relates to the address space. For example, a 32-bit architecture can only address 4GB or even smaller portions of files. Kafka logs which are often larger enough might have only portions mapped at a time, therefore complicating reading them.
However, index files are sparse which means they are relatively small in size. Mapping them into the memory could speed up the lookup process and that's the primary benefit memory-mapped files offer.
Logs are where the messages are stored, the index files point to the position in the logs.
There is a nice, colorful blog post, explaining what is going on.
Having a fast index to improve read performance is a common optimization in databases where writes are append-only(Almost all LSTM databases do some form of this). Also as others have pointed out:
indexes are sparse, so smaller memory footprint. Even the sparsity of the index is configurable, which is useful as data grows.
Append only write patterns are faster than random seeks(especially true for SSDs), and therefore don't need a lot of attention for optimization.
if mmap log file, as physical memory is limited, it may cause page fault frequently which is a seriously expensive overhead. use sendFile system call is more suitable

Memory usage of zfs for mapped files

I read the following on https://blogs.oracle.com/roch/entry/does_zfs_really_use_more
There is one peculiar workload that does lead ZFS to consume more
memory: writing (using syscalls) to pages that are also mmaped. ZFS
does not use the regular paging system to manage data that passes
through reads and writes syscalls. However mmaped I/O which is closely
tied to the Virtual Memory subsystem still goes through the regular
paging code . So syscall writting to mmaped pages, means we will keep
2 copies of the associated data at least until we manage to get the
data to disk. We don't expect that type of load to commonly use large
amount of ram
What does this mean exactly? does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file? or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
Does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file?
Hopefully, no.
or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
That method is just regular low level write(fd, buf, nbytes) system calls and similars and not what memory mapped files are designed to support: accessing file content just with reading / writing memory by using pointers, using the file data as a byte array or whatever.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
No, unless memory mapped files that are also written to using system calls sum to a significant part of your RAM workload, which is quite unlikely to happen.
PS: Note that this blog is almost ten years old. There might have been changes in the implementation since that time.

What is memory map in mongodb?

I read about this topic at
http://docs.mongodb.org/manual/faq/storage/#faq-storage-memory-mapped-files
But didn't understand point .Does it is used to keep query data in physical memory ? How it is related with virtual memory ? Why it is important and how it effect at performance ?
I'll try to explain in a simple way.
MongoDB (and other storage systems) stores data in files. Each database has its own files, created as they are needed. The first file weights 64 MB, the next 128 and so up to 2 GB. Then, new files created weigh 2 GB. Each of these files are logically divided into different blocks, that correspond with one virtual memory block.
When MongoDB needs to access a file or a part of it, loads all virtual blocks corresponding to that file or parts of the files into memory using mmap.On the other hand, mmap is a way for applications to leverage the system cache (linux).
So what really happens when you are doing a query is that MongoDB "tells" the OS to load the part it needs with the data requested, so the next time is requested will be faster. As you can imagine this is a very important feature to boost performance in databases like MongoDB, because accessing RAM is way faster than hard drive.
Another benefit of using mmap is that MongoDB memory will grow as it needs and the system memory is free.

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

Is Tie::File lazily loading a file?

I'm planning on writing a simple text viewer, which I'd expect to be able to deal with very large sized files. I was thinking of using Tie::File for this, and kind of paginate the lines. Is this loading the lines lazily, or all of them at once?
It won't load the whole file. From the documentation:
The file is not loaded into memory, so this will work even for gigantic files.
As far as I can see from its source code it stores only used lines in memory. And yes, it loads data only when needed.
You can limit amount of used memory with memory parameter.
It also tracks offsets of all lines in the file to optimize disk access.