I am using Apache Ignite 2.0 as key-value store. My key will be 10 to 15 digit number and value is a single integer. I intent to store around 500 million such key-value pairs.
On benchmarking, found that to store 1million such key-value entries, the memory utilisation is around 90MB. To store just an integer as value, I don't want to invest such huge memory. i.e., around 90Bytes per record having single integer as value.
Can someone throw some light on how can I reduce memory overheads used in key. Are there any configurations? As I don't need any compression, can I cut down BinaryObjectImpl which is using 40Bytes or some other meta info?
OR if someone can highlight on other kv-stores or technologies I can explore to store huge number of records (even few billions) with lesser memory foot print would be appreciated.
Update:
I am using example-cache.xml as my configuration which is shipped as is without any changes.
Thanks in Advance.
If you only need a key-value store (no data grid) have you tried MapDB? http://www.mapdb.org/
It offers off-heap storage and scales well. We've used it successfully in a couple of projects for cache storage.
Related
I am considering using memcached in conjunction with my PHP app to store 5 millions key-value pairs. My objective is to avoid back and forth from DB (which in my case is the filesystem). I may have 100-500 accesses per seconds to the key-values. The Key-values are both MD5's and are in the form:
array( 'MD5X' => 'MD5Y', ... )
I am not sure how the data is stored, but if we multiply 5 million * 16 bytes (keys) + 5 million * 16 bytes (values) we get ~180MB.
(EDIT: after trying with a real memcached instance I use up 750MB to store all items.)
The dataset is fixed so I will only read from it.
Questions:
Is this a good or bad design?
Can I force memcached to never (unless server crashes) have to reload the data? Assuming that the memory cap is higher than the data stored? If not, which techniques may I employ to achieve the same goal.
Thanks a lot!
Will you get the performance you need? Definitely. Memcache is blazing fast.
We store about 10 million keys and we access memcache about 700 times a sec. It has never let us down.
You can load all the keys in memcache when you start the application and set the expiration date to be a very long time. The thing that you got to remember is that memcache is ultimately a cache. And it should not be used as a storage engine. You have to design it thinking that there is always a possibility of not finding the data (key) that you need, and make a DB call in that case.
The alternative you can look at a noSQL database like cassandra, It has excellent read and write speeds that should cater to your needs. The only thing is it is a bit difficult to fine tune cassandra as compared to memcache.
MongoDB is a document-oriented database.
Meteor pub/sub/call communicates the data through JSON.
JSON uses a key–value pair style formalism.
This means each time data is sent, the 'key' are sent with the value.
length(json sent) = length(attributes values) * length(attributes names) * Xdoc
Let's simplify and say that in average keys and values have the same length.
length(json sent) = 2 x length(attributes values) * Xdoc
This means that half of the data (and I am skipping the =/,/{,} ) is redundant.
Document-based is not table-sql-like and attributes of a same collection may totally differ.
But does that really make no sense to try to optimize this ?
For instance building a key dictionary, using binary or optimizing size as google protocol buffer would ?
Why this question ? Because I have 10MB collections that the client needs, and it's getting slow, if course I would optimize with pagination and filtering keys, but I want to know :)
-- A meteor/mongo noob.
PS: I am not looking for a walkthrough, but for an explanation of why no optimisation could be done on the json data length sent.
The answer is that the actual amount of bytes transferred is very seldom an issue. 10M is not a lot of data unless you have some limitations on your network.
Transferring 10M in one, or a few frames (as in request/response round-trips) is no problem. But if you're doing it in 10000 frames with 1k per frame it will take lots of ms.
And then this data is stored (expanded) client side, most often several times with additional overhead, eating a lot of RAM and making the app slow too.
So unless you have some funny corner case, working on keeping the number of request/responses down and limiting the amount of copying of the data that goes on in your client code will have a significantly larger impact than worrying about sending 10M bytes over the wire.
I have a 215MB csv file which I have parsed and stored in core data wrapped in my own custom objects. The problem is my core data sqlite file is around 260MB. The csv file contains about 4.5million lines of data on my city's transit system (bus stop, times, routes etc).
I have tried modifying attributes so that arrays of strings representing stop times are stored instead as NSData files but for some reason the file size still remains at around 260MB.
I can't ship an app this size. I doubt anyone would want to download a 260MB app even if it means they have the whole city's transit schedule on it.
Are there any ways to compress or minimize the storage space used (even if it means not using core data, I am willing to hear suggestions)?
EDIT: I just want to provide an update right now because I have been staring at the file size in disbelief. With some clever manipulation involving strings, indexing and database normalization in general, I have managed to reduce the size down to 6.5MB or 2.6MB when compressed. About 105,000 objects stored in Core Data containing the full details of the city's transit system. I'm almost in tears right now D':
Unless your original CSV is encoded in a really foolish manner, it seems unlikely that the size is not going to get below 100M, no matter how much you compress it. That's still really large for an app. The solution is to move your data to a web service. You may want to download and cache significant parts, but if you're talking about millions of records, then fetching from a server seems best. Besides, I have to believe that from time to time the transit system changes, and it would be frustrating to have to upgrade a many-10s-of-MB app every time there was a single stop adjustment.
I've said that, but actually there are some things you may consider:
Move booleans into a bit fields. You can put 64 booleans into an NSUInteger. (And don't use a full 64-bit integer if you just need 8 bits. Store the smallest thing you can.)
Compress how you store times. There are only 1440 minutes in a day. You can store that in 2 bytes. Transit times are generally not to the second; they don't need a CGFloat.
Days of the week and dates can similarly be compressed.
Obviously you should normalize any strings. Look at the CSV for duplicated string values on many lines.
I generally would recommend raw sqlite rather than core data for this kind of problem. Core Data is more about object persistence than raw data storage. The fact that you're seeing a 20% bloat over CSV (which is not itself highly efficient) is not a good direction for this problem.
If you want to get even tighter, and don't need very good searching capabilities, you can create packed data blobs. I used to do this on phone switches where memory was extremely tight. You create a bit field struct and allocate 5 bits for one variable, and 7 bits for another, etc. With that, and some time shuffling things so they line up correctly on word boundaries, you can get pretty tight.
Since you care most about your initial download size, and may be willing to expand your data later for faster access, you can consider very domain-specific compression. For example, in the above discussion, I mentioned how to get down to 2 bytes for a time. You could probably get down to 1 bytes in many cases by storing times as delta minutes since the last time (since most of your times are going to be always increasing by fairly small steps if they're bus and train schedules). Abandoning the database, you could create a very tightly encoded data file that you could extract into a database on first launch.
You also can use domain-specific knowledge to encode your strings into smaller tokens. If I were encoding the NY subway system, I would notice that some strings show up a lot, like "Avenue", "Road", "Street", "East", etc. I'd probably encode those as unprintable ASCII like ^A, ^R, ^S, ^E, etc. I'd probably encode "138 Street" as two bytes (0x8A13). This of course is based on my knowledge that è (0x8a) never shows up in the NY subway stops. It's not a general solution (in Paris it might be a problem), but it can be used to highly compress data that you have special knowledge of. In a city like Washington DC, I believe their highest numbered street is 38th St, and then there's a 4-value direction. So you can encode that in two bytes, first a "numbered street" token, and then a bit field with 2 bits for the quadrant and 6 bits for the street number. This kind of thinking can potentially significantly shrink your data size.
You might be able to perform some database normalization.
Look for anything that might be redundant or the same values being stored in multiple rows. You will probably need to restructure your database so these duplicate values (if any) are stored in separate tables and then referenced from their original row by means of id's.
How big is the sqlite file compressed? If it's satisfactorily small, the simplest thing would be to ship it compressed, then uncompress it to NSCachesDirectory.
We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights
purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.
Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.
It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.
VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.
I use memcached to store the integer result of a complex calculation. I've got hundreds of integer objects that I could cache! Should I cache them under a single key in a more complex object or should I use hundreds of different keys for the objects? (the objects I'm caching do not need to be invalidated more than once a day)
I would say lots of little keys. This way you can get the exact result you want in 1 call with minimal serialization effort.
If you store it in another object (an array for example) you will have to fetch the array from cache and then fetch the item you actually want again from that array, plus you have the overhead of serializing/deserializing the whole complex object again. Depending on your language of choice this might mean manually writing a serialization/deserialization function from scratch.
I wrote somewhat large analysis at http://dammit.lt/2008/12/25/memcached-for-small-objects/ - it outlines how to optimize memcached for small object storage - it may shed quite some light on the issue.
It depends on your application. While memcached is very fast, it does require some request transmission and memory lookup time per request. Those numbers increase depending on whether or not the server is on the local machine (localhost), on the local network, or across a wide area. The size of your cache generally doesn't affect the lookup speed.
So, if your application is using MANY objects per processing unit (per request, method, or what-have-you), then it's generally better to define your cache in a way which lowers total number of hits to the cache while at the same time trying not to duplicate cache data. Like everything else, it's a balance.
i.e. If you have a web request which pulls a list of blog posts, it would be more beneficial to cache the entire object list as one memcached key, rather than (and this is a somewhat bad example, obviously) caching an array of cache keys for that list, which relate to individually memcached objects.
The less processing you have to do of the cached values, the better. So why not just dump them into the cache individually?
I would say you should store values individually and use some kind of helper class to retrieve values with multiget and generate a complex dataobject for you.
It depends on what are those numbers. If you could, for example, group them in ranges, then you could optimize the storage. If you could hash them, into a map, or hashtable and store that map serialized in memcached would be good to.
Anyway, you can save many little keys, just make sure you configure the slabs to have chunks with small size, so you will not waste memory space.