Delete rows in Memcached based on date - perl

Hello I am implementing simple garbage collector over memcached in Perl.
And i want to delete all rows (key, value is serialized(payload, date)) before or after date.
What is the best effective implementation? Get all data and then check the date in for cycle(But the data could get very big and i think it could be slow and not very effective)?
Any other ideas or opinions?
Thanks, Cospel

You cannot iterate memcached keys in a effective way(i mean there is no "good" way to do that). Best solution is to setup proper expires field, so entries will be expired/deleted automatically. Also its good to remove the key right after the moment it is no longer needed.
Internally memcached uses LRU, so when no memory available, most unused items will be discarded. This can be entries with big TTL (expire time), so probably its a parameter to tune for your needs.

Related

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.

Redis GET vs. SQL SELECT

I am pretty new to NoSQL, but I always liked the idea of it. I took a look at Redis, and got a few questions about the best ways of storing and recieving multiple hashes.
Assuming the following scenario:
Store a list of objects (redis 'Hashes') and select them by their timestamp.
To archive this in SQL, it would require one table and two simple queries (INSERT & SELECT).
Trying to do this in Redis, I ended up creating the following structure:
Key object:$id (hash) containing the object
Key index:timestamp:$id (sorted set)
score equals timestamp and value includes id
While I can live with the additional maintenance work of two keys instead of one table (SQL), I am curious about the process of selecting multiple objects:
ZRANGEBYSCORE index:timestamp:$id timestampStart timestampEnd
This returns an array of all IDs which got created between timestampStart and timestampEnd. To get the object itself I am requesting every single one by:
GET object:$id
Is this the right way of doing it?
In comparison with an SQL Database: Is it still appreciably faster or might it even become slower caused by the high number of GETs?
A ZRANGEBYSCORE costs O(log(N) + M) where N=|items in your set| and M=|items you're selecting|. So, doing the ZRANGEBYSCORE and then M GET operations is just O(long(N)+M+M) = O(log(N)+M) and would at most be twice as slow. The network back and forth could have been a major slow down, but since each of your gets is an independent operation, you can just pipeline them. You can also put the whole thing in a Lua script and just have one back and forth, which would be the most optimal. I'd say with 99% certainty this would be faster than doing the same thing in SQL.
Also, if this is a very frequent operation for you, you can get even more speed up by just storing the entire object in your sorted set instead of just the id. You'd have key = object encoded as json, score = timestamp. This would save you O(M) on your operation in terms of not needing to do any GETs.
Whether or not this is a good way of doing things really depends on your use case. How much speed do you really need, and how important are other features of a traditional database to you? Remember, Redis is much more just datastructures accessible by clients than a traditional database, and it must store everything in RAM. To know whether it's the right thing for you, we'd need more information.

How should I implement "get objects changed since" pattern with MongoDB?

I have a collection of objects, let's say they are "posts," and those objects can be modified. I'd like to display a list on the client side that updates dynamically. So on the client side, if doing this via polling, the client would invoke an API like:
getPostsChangedSince(serial)
where serial could be a monotonically increasing number, probably a timestamp. The client gets back a list of posts that have changed since that time, stores a new latest-serial, and next time the client polls it requests changes since that latest serial.
I think the basic idea is the same in this question (which is about ASP.NET): How to implement "get latests changed items" with ADO.NET Data Services?
I'm trying to find the best way to implement this in MongoDB.
I like the idea of using the time for the serial, since it automatically works at least mostly correctly even if there are multiple app servers. The serial would be stored in each post object, and updated whenever the object is modified.
The timestamp-based serial could be implemented as:
a Date (I think this is stored as a 64-bit milliseconds since epoch?)
a Timestamp http://www.mongodb.org/display/DOCS/Timestamp+Data+Type
something "by hand" e.g. store milliseconds as a number
Some nice features to have in a solution would include:
ensure that creating then immediately updating an object within the OS timer resolution will still increment the serial despite it being the same time
even better would to be guaranteed monotonic increase globally for all objects, not just guarantee that changing a given object will bump the serial on that object (absent this, getPostsChangedSince() calls probably need a fuzz backward in time, to avoid missing changes - at price of getting some changes twice)
mongodb-side timestamps might be nice because getting the time in the app creates a gap between when you get the time, and when the new object is saved and available in queries
update using findAndModify() with a query including the old serial, so "conflicts" (two changes at once) will throw an error allowing the app to retry
I realize some of the corner cases here are a little bit "academic" and can likely be fudged around in real life.
My approach so far is:
use the Date type for the serial
when modifying an object, get the current time, and if it matches the object's old serial, add 1 millisecond (yes this breaks if you make two modifications quickly without re-fetching from mongodb, but that seems OK)
use findAndModify(), but based on https://jira.mongodb.org/browse/JAVA-276 there may not be a way to detect if it ends up not finding anything to modify (i.e. second change is ignored, in case of conflict)
Questions:
I feel like I should use Timestamp instead; true? Any downsides?
if you had a mongo cluster, might time in milliseconds be more unique and correct than Timestamp's time in seconds plus a number, while with one mongod Timestamp is more unique?
is there a way to detect whether findAndModify() updated anything?
any general advice / experiences with this problem? how would you do it?
Have you considered "externalizing" the serial number generator? Time with MongoDB precision is good, but can become difficult to synchronize when involving multiple machines. One choice is that you can use memcached or something similar which is memory based, extremely fast and can be serialized (memcached has a CAS operation).
So what you would do is store a "seed" in memcached with a key say, counter.
Everytime an app needs to do an insert, it gets the next number from memcached and increments the counter.
On second thoughts, you can even do away with memcached and just use a single row (sorry document) collection that just has the counter. You can get the counter and increment it which will be an extremely fast operation, mimicking memcached.
And then naturally, you can index the data appropriately. However, I am wondering that this would result in the index to be very imbalanced (right-side loped). Depending upon the situation, it might be worthwhile exploring the use of capped collection. So when you insert data into your main collection, also insert it into the capped collection and read data from that collection.
You could continue to use your regular collection, as you do now, and after each update additionally insert the ID of the post into a special TTL collection. See http://docs.mongodb.org/manual/tutorial/expire-data/ for more info on using such a collection. Mongo will take care of all timing issues, you don't need to worry about serial numbers, and you can very quickly access time based lists of objects by their IDs.
Caveat:
use the blocking form of findAndModify, to ensure the changes have really been processed:
Blocking/Safe Writes
Unless you specify the "new" parameter as true the write operation will not block, and will not return an error (if there is one). If you do want the "new" document returned then the operation will wait until the write is done to return the new document, or an error.
For a "safe" (blocking) write operation you must call getLastError (if not using "new").

Memcached best practices - small objects and lots of keys or big objects and few keys?

I use memcached to store the integer result of a complex calculation. I've got hundreds of integer objects that I could cache! Should I cache them under a single key in a more complex object or should I use hundreds of different keys for the objects? (the objects I'm caching do not need to be invalidated more than once a day)
I would say lots of little keys. This way you can get the exact result you want in 1 call with minimal serialization effort.
If you store it in another object (an array for example) you will have to fetch the array from cache and then fetch the item you actually want again from that array, plus you have the overhead of serializing/deserializing the whole complex object again. Depending on your language of choice this might mean manually writing a serialization/deserialization function from scratch.
I wrote somewhat large analysis at http://dammit.lt/2008/12/25/memcached-for-small-objects/ - it outlines how to optimize memcached for small object storage - it may shed quite some light on the issue.
It depends on your application. While memcached is very fast, it does require some request transmission and memory lookup time per request. Those numbers increase depending on whether or not the server is on the local machine (localhost), on the local network, or across a wide area. The size of your cache generally doesn't affect the lookup speed.
So, if your application is using MANY objects per processing unit (per request, method, or what-have-you), then it's generally better to define your cache in a way which lowers total number of hits to the cache while at the same time trying not to duplicate cache data. Like everything else, it's a balance.
i.e. If you have a web request which pulls a list of blog posts, it would be more beneficial to cache the entire object list as one memcached key, rather than (and this is a somewhat bad example, obviously) caching an array of cache keys for that list, which relate to individually memcached objects.
The less processing you have to do of the cached values, the better. So why not just dump them into the cache individually?
I would say you should store values individually and use some kind of helper class to retrieve values with multiget and generate a complex dataobject for you.
It depends on what are those numbers. If you could, for example, group them in ranges, then you could optimize the storage. If you could hash them, into a map, or hashtable and store that map serialized in memcached would be good to.
Anyway, you can save many little keys, just make sure you configure the slabs to have chunks with small size, so you will not waste memory space.