Best PostgreSQL hiearchical tree for both performance and moving nodes from GUI? - postgresql

Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).
Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.
As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?
Requirements:
Query performance is of course my top priority (all nodes, subtrees, leafs).
The tree should support deep level nesting, and sorting
And of course the tree should have support for growing large and
scaling big
I can live with a little waiting time while reordering from the GUI,
if 1 "jack-of-all-trades" tree implementation doesn't exist, or is
too complex for being worth it.
I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.
Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?

The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.
As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.
When write performance is important for you, you'll usually be better off working on the raw (id, parent_id) tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.
An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.
(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)
As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.
That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.

Related

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

R-tree - Remove algorithm using reinsertion

I am trying to implement an R-tree in scala following the guidelines from the original paper about the R-tree structure. In the deletion algorithm section is stated:
Reinsert all entries of nodes in set Q. Entries from eliminated leaf nodes are reinserted in tree leaves as described in Insert, but entries from higher level nodes must be placed higher in the tree, so that leaves of their depedent subtrees will be on the same level as leaves of the main tree.
I can't wrap my head around the last part. What is meant by higher level nodes must be placed higher in the tree? How is that implemented? My idea was that I remove nodes that underflow add them to the set Q (their entries) and in the end I reinsert their entries using Insert. Is this incorrect or partially correct that requires something extra? If you can explain using examples as well that would be great.
Nodes must be reinserted in the correct height, or the tree will become invalid. Remember that all leaves must be at the same level.
Inserting and removing values in R-Tree is a quite expensive operation when you need to keep it optimally balanced for fast window or nearest requests, especially in a multi-thread environment.
A more efficient approach is using one writer (actor or thread) which gathers updates in batches, packs a new R-Tree instance and publishes it in some volatile variable for reading.
Here is a comparison of some R-Tree implementations that can be used in such a way from Scala.

PostgreSQL ltree- vs tree module vs integer/string arrays or string delimited path

As you may know there's a module for PostgreSQL called ltree. Also you have the possibility to use the Array type for integers (*1, see comment below), which in this test shows to actually perform a little slower with its recursive queries, compared to ltree - except from the string indexing (*2, see comment below).
I'm not too sure about the credibility of these testresults though.
My biggest question here is actually about the relatively unknown, and almost undocumented tree module. Described here (where the documentation also can be found!!) as:
support for hierachical data types (sort of lexicographical trees),
should go to contrib/tree, pending because of lack of proper
documentation.
After reading through the documentation I'm a little bit confused as to whether or not I should base my big application (a CMS, where everything will be stored in a hiearchical tree structure - not only content, also files etc, so you can see this quickly scales up) around ltree, normal Materialized Path (Path Enumeration) with a delimited string or integer array as path - or if the relatively unknown "tree" module in theory should be the faster performing, more scalable and better solution of the two.
I've already analysed the different tree structure models and due to query performance, scalability and reordering of nodes and subtrees being my main requirements, I've been able to rule out Adjacency Lists (recursive CTE will not solve performance as the tree scales huge), Nested Sets/Intervals (not fast enough in some queries, considering its disadvantages when manupulating the tree), Closure Tables (terribly at scaling big in complex trees - not useful for such a large project as mine) etc and decided to go with the Materialized Path, which is super fast for read operations, and makes it easy to move subtrees and nodes around the hiearchy. So the question is only about the best of the proposed implementations for Materialized Path.
I'm especially curious in hearing your theories or experiences with "tree" in PostgreSQL.
AS far as I read, contrib/tree was never officially released, whereas ltree was merged into PostgreSQL's core.
I understand both use the same idea of labeled path, but tree only allowed integer labels, when ltree allows text labels that permits fulltext searches, thought the full path length is limited (65Kb max, 2Kb prefered).

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?
Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."
Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.

Database Optimization techniques for amateurs

Can we get a list of basic optimization techniques going (anything from modeling to querying, creating indexes, views to query optimization). It would be nice to have a list of these, one technique per answer. As a hobbyist I would find this to be very useful, thanks.
And for the sake of not being too vague, let's say we are using a maintstream DB such as MySQL or Oracle, and that the DB will contain 500,000-1m or so records across ~10 tables, some with foreign key contraints, all using the most typical storage engines (eg: InnoDB for MySQL). And of course, the basics such as PKs are defined as well as FK contraints.
Learn about indexes, and use them properly. Generally speaking*, follow these guidelines:
Every table should have a clustered index
Fields used for filters and sorts are good candidates for indexing
More selective fields are better candidates for indexing
For best performance on crucial queries, design "covering indexes" for those queries
Make sure your indexes are actually being used, and remove those that aren't
If your table has 15 fields, and you make 15 indexes, each with only a single field, you're doing it wrong :)
*There are some exceptions to these rules if you know what you're doing. My experience is Microsoft SQL Server, but I would presume most of this advice would still apply to a different RDMS.
IMO, by far the best optimization is to have the data model fit the problem domain for which it was built. When it does not, the resulting symptom is difficult-to-write or convoluted queries in order to get the information desired and that typically rears itself when reports are built against the database. Thus, in designing a database it helps to have an idea as to the types and nature of the information, such as reports, that the users will want from the system.
When talking database design, check out the database normalization, e.g. the wikipedia article: Normal forms.
If you have a good design and still you need to optimize for performance, try Denormalisation.
If you have specific needs which are not covered by relational model efficiently, look at other models covered by the term NoSQL.
Some query/schema optimizations:
Be mindful when using DISTINCT or GROUP BY. I find that many new developers will use DISTINCT in places where it really is not needed or could be rewritten more efficiently using an Exists statement or a derived query.
Be mindful of Left Joins. All too often I find new SQL developers will ignore the schema in place and use Left Joins where they really are not necessary. For example:
Select
From Orders
Left Join Customers
On Customers.Id = Orders.CustomerId
If Orders.CustomerId is a required column, then it is not necessary to use a left join.
Be a student of new features. Currently, MySQL does not support common-table expressions which means that some types of queries are cumbersome and probably slower to write than they would be if CTEs were supported. However, that will not be true forever. Keep up on new syntax features in MySQL which might be used to make existing queries more efficient.
You do not have to use surrogate keys everywhere. There might be tables better suited to an intelligent key (e.g. US State abbreviations, Currency Codes etc) which would enable developers to avoid additional joins in many cases.
If possible, find ways of archiving data to an OLAP or reporting server. The smaller you can make the production data, the faster it will run.
A design that concisely models your problem is always a good start. Overgeneralizing the data model can lead to performance problems. For example, I've heard reports of projects striving for uber-flexibility that use the RDBMS as a dumb "name/value" store - and resulting performance was appalling.
Once a good design is in place, then use the tools provided by the RDBMS to help it achieve good performance. Single field PKs (no composites), but composite business keys as an index with unique constraint, use of appropriate data types, e.g. using appropriate numeric types for numeric values rather than char or similar. Physical attributes of the hardware the RDBMS is running on should also be considered, since the bulk of query time is often disk I/O - but of course don't take this for granted - use a profiler to find out where the time is going.
Depending upon the update/query ratio, materialized views/indexed views can be useful in improving performance for slow running queries. A poor-man's alternative is to use triggers to invoke a procedure that populates the table with a result of a slow-running, infrequently-changed view.
Query optimization is a bit of a black art since it is often database-dependent, but some rules of thumb are given here - Optimizing SQL.
Finally, although possibly outside the intended scope of your question, use a good data access layer in your application, and avoid the temptation to roll your own - there are surely tested and performant implementations available for all major languages. Use of caching at the data access layer, middle tier and application layer can help improve performance considerably.
Do use less query whenever possible. Use "JOIN", and group your tables so that a single query gives your results.
A good example is the Modified Preorder Tree Transversal (MPTT) to get all of a tree node parents, ordered, in a single query.
Take a holistic approach to optimization.
Consider the impact of slow disks, network latency, lack of memory, and server load.