Why does paginating with offset using PSQL make sense? - postgresql

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.

No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

Related

Table scan vs GSI when not scanning a lot

Is there any metric for when to choose a full table scan over GSI or the other way around?
I know the basic concept behind both but the pricing model for GSI is very dependent on the table itself that i'm having hard time deciding
and more importantly, how would that scale with different table sizes, or how much scanning is too inefficient and requiring a GSI instead
By the by, I'm having a hard time finding good resources for filtering expressions for the query and scan on dynamodb, any good recommendations? ("#v >= :num" this is what i mean, probably not searching with the correct term)
In general the decision to use a query versus a scan comes down to how much of the data you need. If the answer is that you need most of the data (which in practice can only really be case for relatively small tables) then use a scan. Otherwise, use a query — pretty much every time.
It’s impossible to give a hard threshold for what ‘most of the data’ means. I’d say definitely more than 50% and that this threshold tends to 100% as the table size grows.
The exception to the above would be one-off operations that can be performed in the background and where you you’re willing to trade time for cost. And the corollary, that if you are getting data for a customer facing request your aim should be to read as little from the database as possible to keep request times short.
All this being said, parallel scans can be super quick to pull in a lot of data, if you really need it and you are in a position to consume extra capacity. Even on largish tables, as long as you have the capacity to spare you can pull in hundreds of thousands of items in just a few seconds.

Is it bad practice to keep everything in one table?

Looking for some feedback - I am building a social networking type software- one of the features allows users to post news stories and have friends comment. I have in the past kept different tables for things like news, comments, calendar events, etc. However a friend has turned me to the wordpress-type database structure of "POSTS" and "post_types" where everything is in one table and has a "post_type".
This would mean that news stories, comments, events, etc are all in the same table. I love the efficiency of creating functions that are updating one table. HOWEVER, a single table in my old software was 1.5MILLION rows, I'd expect this new table to grow to about 10Million in the first year.
Does mysql handle this size of data okay as long as indexes are properly set, or is it smarter to break everything into seperate tables for this reason?
There is no general answer. It depends.
MySQL has no problem dealing with large tables. However, it will not do miracles for you. In the end, it's all about efficiency. It means you need to optimize your design for multiple, mutually exclusive goals. What you want to find is a sweet spot between complexity, performance, extensibility and maintenance costs. This is different for every project and is kind of an art.
Generally don't want to mix things that are too different. This is why they teach about data normalization in just about every database book or CS course. If your data is small, this does not really matter. But if you have a lot of data and a lot of requests, you will almost certainly want to squeeze every last drop of performance from your database. So not only will you be separating tables, scrutinizing indexes, inspecting execution plans, updating statistics, defragmenting pages and measuring performance, but you will also be using partitioning, clustering, materialized views, read-only replicas, I/O and CPU parallelism, SSDs, Memcached and a variety of other tools. This will all be much more challenging if you have started with a bad data model. In my personal experience, locking is something that really bites you in the ass with large tables, unless you can somehow live without transactions.
To make any kinds of estimations, you need to have some performance baseline. Just knowing number of records is not enough. How many requests will there be? What will the queries be doing? Where do you expect the heaviest load? Can you prepare the most common queries that the system will be running most of the time? What about peak hours? What hardware will be available to run this load? What is the ratio of reads to writes? Etc.
To make optimizations, you need some kind of goal. As always, you will find out that in order to get there, you have to sacrifice something. Because you probably don't have all those answers yet, try following the principle of minimalism - start small, measure, analyze, improve, repeat.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").
Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.
We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:
FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)
FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'
etc.
Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.
Thanks!
With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.
NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.
CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.
Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.
You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.
I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!
All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.
The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!
Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.
For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.
I think you should have a look into how ElasticSearch would enable you:
Blazing speed
Schema-free storage
Sharding and distributed architecture
Powerful analytic primitives in the form of facets
Easy implementation of „sliding window”-type of data storage with index aliases
It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.
Have a look on these slides for real world use case for „sliding window” scenario.
There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.
For the record (with all due respect to #jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.
If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. The work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing MySQL on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. However for your use case I think a Column Family database may be the best solution, I think there are a few places which have implemented similar solutions to very similar problems (I think the NYTimes is using HBase to monitor user page clicks). Another great example is Facebook and like, they are using HBase for this. There is a really good article here which may help you along your way and further explain some points above. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Final point would be that NoSQL systems are not the be all and end all. Putting your data into a NoSQL database does not mean its going to perform any better than MySQL, Oracle or even text files... For example see this blog post: http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html
I'd have a look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive - Also have a look at Hadoop streaming...
Hypertable - Another CF CP DB.
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

A question about indexes regarding to the gain of inserts & updates in database

I’m having a question about the fine line between the gain of an index to a table there is growing steadily in size every month and the gain of queries with an index.
The situation is, that I’ve two tables, Table1 and Table2. Each table grows slowly but regularly each month (with about 100 new rows for Table1 and a couple of rows for Table2).
My concrete question is whether to have an index or to drop it. I’ve made some measurement that an covering index on Table2 improve my SELECT queries and some rather much but again, I’ve to consider the pros and cons but having a really hard time to decide.
For Table1 it might not be necessary to have an index because the SELECT queries there is not that common.
I would appreciate any suggestion, tips or just good advice to what is a good solution.
By the way, I’m using IBM DB2 version 9.7 as my Database system
Sincerely
Mestika
Any additional index will make your inserts slower and your queries faster.
To take a smart decision, you will have to measure exactly by how much, with the amount of data that you expect to see. If you have multiple clients accessing the database at the same time, it may make sense to write a small multithreaded application that simulates the maximum load, both for inserts and for queries.
Your results will depend on the nature of your data and on the hardware that you are running. If you want to know the best answer for your usecase, there is no way around testin accurately yourself with your data and your hardware.
Then you will have to ask yourself:
Which query performance do I need?
If the query performance is good enough without the index anyway, easy: Don't add the index!
Which insert performance do I need?
Can it drop below the needed limit with the additional index? If not, easy: Add the index!
If you discover that you absolutely need the index for query performance and you can't get the required insert performance with the index, you may need to buy better hardware. Solid state discs can do wonders for database servers and they are getting affordable.
If your system is running fine for everyone anyway, worry less, let it run as is.