PostgreSQL: Efficient way to get Average of records - postgresql

I have a table called "ITEM_REVIEW" with a column named "score".
I know that I can get average of score by:
SELECT AVG(score) FROM "ITEM_REVIEW" WHERE "item_id"=1
Is this the right(and efficient) way to calculate average even when records are piled up? or is it better to fetch data and calculate average on my NodeJS server?

I'd simplify it to
SELECT AVG(ir.score) FROM "ITEM_REVIEW" ir
Yes it's massively more efficient to keep a million rows in the db on a fast hard disk, and calculate a single float that you pass over a network connection, than it is to transmit a million floats over a network connection into a shared server with more varied responsibilities and use some slow JavaScript to calculate the average
Databases are incredibly good at storing, collating, connecting and processing data; it's their sole purpose/entire reason for being. Teams of smart people spend all their time implementing and improving the performance of data processing features to ensure their db stays top of the pile. Leave your data in a database wherever possible, and drag the smallest number of bytes you can over slow network links (summarise and filter in the db always)

If you really care, you should try it and see. That way you get the correct answer for your versions, hardware, configuration, etc. Pay special attention to memory usage, as NodeJS might store all the data in memory before it gets around to averaging it.
But in general trying to re-implement the database outside of the database is going to lose. Not always, but in general.

Related

Table scan vs GSI when not scanning a lot

Is there any metric for when to choose a full table scan over GSI or the other way around?
I know the basic concept behind both but the pricing model for GSI is very dependent on the table itself that i'm having hard time deciding
and more importantly, how would that scale with different table sizes, or how much scanning is too inefficient and requiring a GSI instead
By the by, I'm having a hard time finding good resources for filtering expressions for the query and scan on dynamodb, any good recommendations? ("#v >= :num" this is what i mean, probably not searching with the correct term)
In general the decision to use a query versus a scan comes down to how much of the data you need. If the answer is that you need most of the data (which in practice can only really be case for relatively small tables) then use a scan. Otherwise, use a query — pretty much every time.
It’s impossible to give a hard threshold for what ‘most of the data’ means. I’d say definitely more than 50% and that this threshold tends to 100% as the table size grows.
The exception to the above would be one-off operations that can be performed in the background and where you you’re willing to trade time for cost. And the corollary, that if you are getting data for a customer facing request your aim should be to read as little from the database as possible to keep request times short.
All this being said, parallel scans can be super quick to pull in a lot of data, if you really need it and you are in a position to consume extra capacity. Even on largish tables, as long as you have the capacity to spare you can pull in hundreds of thousands of items in just a few seconds.

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.
15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

Why don't most NoSQL DBMSs have “pointers”?

What is the objective reason fo why don't most NoSQL storage solutions have some kind of "pointers" for ultra-efficient joins, like the pre-relational DBMSs had?
I mean, I partially understand the theoretical reasons for why classical RDBMSs have ditched pointers (need to update them and double sync them for memory and disk, no "disks" fast enough to be treatable like random-access for some use cases, like modern SSDs can, etc.).
But of the many NoSQL solutions out there, why do just so few of them realize that this model would be awesome (exception I know of would be OrientDB and Neo4j) for many practical cases, not only ones that need graph traversals. I mean, when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one.
Isn't the use case of a NoSQL document-db overlapping enough with the one of graph DBs that such a feature would make sense and would just provide all the practical features of SQL-joins to the NoSQL solutions with not much extra cost, and for most queries would make indexes useless, and take up much less space for huge datasets?
(...and as a bonus any NoSQL solution would be ready to use as a graph db, and doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough)
I believe the key problem is data locality and horizontal scalability. A premise of NoSQL is that the read-heavy models of RBDMSs, i.e. those that require joins, lead to bottlenecks.
Think of Twitter: the original data model was read-heavy, but the joins you need to make are insanely large (billions of tweets x hundreds of millions of users x tens of billions of follower-followee relations that are wildly varying in size [1-10M, or whatever aplusk has these days]).
When even the ids you'll want to join don't fit in a reasonable machine's RAM, calculating the overlap of ids becomes terribly expensive. If you take the actual data into account, horizontal scalability becomes next to impossible because there's no a priori knowledge which shards / machines will need to be hit. Storing all follower pointers in every follower-list would require insane bookkeeping for trivial changes, while not exploiting creation-time locality (or at least, creation-time locality per feed).
In a multi-tenant application, you can always shard by the tenants, or by the sales region or by agents or maybe even by time: You can find some locality criterion that is good for like > 95% of the cases.
With graphs, that becomes a lot more complicated, especially those which have certain connection properties (scale-free networks with small diameter / small world phenomenon): A simple post, say by a celebrity, can quickly spread through a large portion of the entire network, meaning that practically every query must hit the one node that holds the post.
Sure, the post itself would be cached by the web servers, but add likes and comments, or favorites and retweets and the story becomes a nightmare (writes!) Add in notification emails, content ranking and filtering and you're in true horror.
doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough
If that data happens to be on 100 different nodes, the sheer network overhead will be in the range of 50ms, even in a single datacenter with no congestion and idle machines. If this spreads across the world or individual queries take a little longer, you'll quickly end up at 5000ms. Also, the query would fail if only one machine is down.
This depends too much on the details of the network, which is why the problem should be solved by application code, not by the data store.
when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one
When you need multi-joins in MongoDB, you're using the wrong tool for your data model, or vice versa. Multi-Join means normalized means read-heavy which battles the key concept of MongoDB. However, you can store quite large association lists even in MongoDB. But the tool becomes almost irrelevant here: If you look at Facebook TAO, for instance, there's little technology dependence in that.

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.