using postgres server-side cursor for caching - postgresql

To speed page generation for pages based on large postgres collections, we cache query results in memcache. However, for immutable collections that are very large, or that are rarely accessed, I'm wondering if saving server side cursors in postgres would be a viable alternate caching strategy.
The idea is that after having served a page in the middle of a collection "next" and "prev" links are much more likely to be used than a random query somewhere else in the collection. Could I have a cursor "WITH HOLD" in the neighborhood to avoid the (seemingly unavoidable) large startup costs of the query?
I wonder about resource consumption on the server. If the collection is immutable, saving the cursor shouldn't need very many resources, but I wonder how optimized postgres is in this respect. Any links to further documentation would be appreciated.

You're going to run into a lot of issues.
You 'd have to ensure the same user gets the same sql connection
You have to create a cleanup strategy
The cursors will be holding up vacuum operations.
You have to convince the connection pool to not clear the cursors
Probably other issues I have not mentioned.
In short: don't do it. How about precalculating the next/previous page in background, and storing it in memcached?

A good answer to this has previously been made Best way to fetch the continuous list with PostgreSQL in web
The questions are similar, essentially you store a list of PKs on the server with a pagination-token and an expiration.

Related

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

Using memcache infront of a mongodb server

I am trying to understand how mongo's internal cache works and if it does eliminate using memcache. Our database size is around 200G and index fits in the memory but after the index not much free memory left on the server.
One of my colleague says mongo's internal cache will be as fast as memcache so no need to introduce another level of complexity by using memcache.
The scenario in my head is when we read the data from db, it's saved in memcache and next time it's directly read from the cache instead of going back to db server. If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
I have been reading about this but couldn't convince myself yet. So I'd really appreciate if someone could shed some light on this.
First thing is that a cache storage is different to a database. So MongoDB and SQL are different in purpose and usage when compared to Memcache.
Memcache is really good at lowering working set sizes for queries. For example: imagine a huge aggregated query with subselects and CASE statements and what not in SQL (think of the most complex query you can), doing this query in realtime all the time could cause the computer(s) to "thrash" (not to mention the problems client side).
However as everyone knows you need only summarise this query to another collection/table for it to be instantly faster. The real speed of memcache comes from the fact that it is a in memory key value store. This is where MongoDB could fail in speed because it is not memory stored, it is memory mapped but not stored.
MongoDB does no self caching, providing the query is "hot" and in LRU (this is where your working set comes in) you shouldn't notice much of a difference in response times. A good way to ensure a query is "hot" is to run it. Some people have a script of their biggest queries that they run to warm up the cache.
As I said memcache is a cache layer this is why:
If the data is changed and needs to be saved/updated, it's done on both memcache server and database server.
Makes me die a little inside. Many do blur the line between the DB and the cache layer.

Using "Cursors" for paging in PostgreSQL [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to provide an API client with 1,000,000 database results?
Wondering of the use of Cursors is a good way to implement "paging" using PostgreSQL.
The use case is that we have upwards 100,000 rows that we'd like to make available to our API clients. We thought a good way to make this happen would be to allow the client to request the information in batches ( pages ). The client could request 100 rows at a time. We would return the 100 rows as well as a cursor, then when the client was ready, they could request the next 100 rows using the cursor that we sent to them.
However, I'm a little hazy on how cursors work and exactly how and when cursors should be used:
Do the cursors require that a database connection be left open?
Do the cursors run inside a transaction, locking resources until they are "closed"?
Are there any other "gotchas" that I'm not aware of?
Is there another, better way that this situation should be handled?
Thanks so much!
Cursors are a reasonable choice for paging in smaller intranet applications that work with large data sets, but you need to be prepared to discard them after a timeout. Users like to wander off, go to lunch, go on holiday for two weeks, etc, and leave their applications running. If it's a web-based app there's even the question of what "running" is and how to tell if the user is still around.
They are not suitable for large-scale applications with high client counts and clients that come and go near-randomly like in web-based apps or web APIs. I would not recommend using cursors in your application unless you have a fairly small client count and very high request rates ... in which case sending tiny batches of rows will be very inefficient and you should think about allowing range-requests etc instead.
Cursors have several costs. If the cursor is not WITH HOLD you must keep a transaction open. The open transaction can prevent autovacuum from doing its work properly, causing table bloat and other issues. If the cursor is declared WITH HOLD and the transaction isn't held open you have to pay the cost of materializing and storing a potentially large result set - at least, I think that's how hold cursors work. The alternative is just as bad, keeping the transaction implicitly open until the cursor is destroyed and preventing rows from being cleaned up.
Additionally, if you're using cursors you can't hand connections back to a connection pool. You'll need one connection per client. That means more backend resources are used just maintaining session state, and sets a very real upper limit on the number of clients you can handle with a cursor-based approach.
There's also the complexity and overhead of managing a stateful, cursor-based setup as compared to a stateless connection-pooling approach with limit and offset. You need to have your application expire cursors after a timeout or you face potentially unbounded resource use on the server, and you need to keep track of which connections have which cursors for which result sets for which users....
In general, despite the fact that it can be quite inefficient, LIMIT and OFFSET can be the better solution. It can often be better to search the primary key rather than using OFFSET, though.
By the way, you were looking at the documentation for cursors in PL/pgSQL. You want normal SQL-level cursors for this job.
Do the cursors require that a database connection be left open?
Yes.
Do the cursors run inside a transaction, locking resources until they
are "closed"?
Yes unless they are WITH HOLD, in which case they consume other database resources.
Are there any other "gotchas" that I'm not aware of?
Yes, as the above should explain.
For HTTP clients, don't use cursors to implement paging. For scalability, you don't want server resources tied up between requests.
Instead, use LIMIT and OFFSET on your queries; see LIMIT and OFFSET in the Pg docs.
But make sure that the indexing on your tables will support efficient queries of this form.
Design a RESTful API, so that the client can invoke the "next_url" (also passed in the response) to get the next set of rows.

SQL vs NoSQL for an inventory management system

I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.