PostgreSQL ltree- vs tree module vs integer/string arrays or string delimited path - postgresql

As you may know there's a module for PostgreSQL called ltree. Also you have the possibility to use the Array type for integers (*1, see comment below), which in this test shows to actually perform a little slower with its recursive queries, compared to ltree - except from the string indexing (*2, see comment below).
I'm not too sure about the credibility of these testresults though.
My biggest question here is actually about the relatively unknown, and almost undocumented tree module. Described here (where the documentation also can be found!!) as:
support for hierachical data types (sort of lexicographical trees),
should go to contrib/tree, pending because of lack of proper
documentation.
After reading through the documentation I'm a little bit confused as to whether or not I should base my big application (a CMS, where everything will be stored in a hiearchical tree structure - not only content, also files etc, so you can see this quickly scales up) around ltree, normal Materialized Path (Path Enumeration) with a delimited string or integer array as path - or if the relatively unknown "tree" module in theory should be the faster performing, more scalable and better solution of the two.
I've already analysed the different tree structure models and due to query performance, scalability and reordering of nodes and subtrees being my main requirements, I've been able to rule out Adjacency Lists (recursive CTE will not solve performance as the tree scales huge), Nested Sets/Intervals (not fast enough in some queries, considering its disadvantages when manupulating the tree), Closure Tables (terribly at scaling big in complex trees - not useful for such a large project as mine) etc and decided to go with the Materialized Path, which is super fast for read operations, and makes it easy to move subtrees and nodes around the hiearchy. So the question is only about the best of the proposed implementations for Materialized Path.
I'm especially curious in hearing your theories or experiences with "tree" in PostgreSQL.

AS far as I read, contrib/tree was never officially released, whereas ltree was merged into PostgreSQL's core.
I understand both use the same idea of labeled path, but tree only allowed integer labels, when ltree allows text labels that permits fulltext searches, thought the full path length is limited (65Kb max, 2Kb prefered).

Related

Scala: Lensing vs mutable design

My basic understanding of lensing is that, "a lens is a value representing maps between a complex type and one of its constituents. This map works both ways—we can get or "access" the constituent and set or "mutate" it"
I came across this when I was designing a machine learning library (neural nets), which demands keeping a big datastructure of parameters, groups of which need to be updated at different stages of the algorithm. I wanted to create the whole parameter data structure immutable, but changing a single group of parameters requires copying all the parameters, and recreating a new datastructure, which sounds inefficient. Not surprisingly other people have thought it too. Some people suggest using lensing, which in a sense, let you modify immutable datastructures. While some other people suggested just using mutables for these. Unfortunately I couldn't find anything on comparing these two paradigms, speed-wise, space-wise, code-complexity-wise etc.
Now question is, what are the pros/cons of using lensing vs mutable design?
The trade offs between the two are pretty much as you surmised. Lenses are less complex than tracking the changes to a large immutable data structure manually, but still require more complex code than a mutable data structure, and there is some amount of runtime overhead. To know how much, you would have to measure, but it's probably less than you think, because a lot of the updated structure isn't copied but shared.
Mutable data structures are simpler and somewhat faster to modify, but harder to reason about, because now you have to take the order functions are called into account, worry about concurrency, and so forth.
Your third option is to make a bunch of small immutable data structures instead of one big one. Mutability often forces a single large data structure because of the need for a single source of truth, and to ensure that all references to data change at the same time. With immutability, this is a lot easier to control.
For example, you can have two separate Maps with the same type of key and different types of simple values, instead of one Map with a more complex value. Not only does this have performance benefits, it also makes it much easier to modularize your code.

Elastic Binary Search Tree in Haproxy

I just look at the source of HAproxy to learn about how is it implemented , and I see an interesting data structure called Elastic Binary Search tree. It seems to be very similar to binary search tree. But I would like to know what is the different and the reason behind choosing this data structure for load balancer.
You'll find the implementation details here : http://1wt.eu/articles/ebtree/
In short, the main difference between a regular binary tree and ebtree s that in a regular binary tree, you need to allocate intermediary nodes to attach leaves, and in some environments, having to allocate a node in the middle just to insert a leaf is not convenient. With ebtrees, each structure is both a node and a leaf, and thanks to some pointer manipulation, both of them can be used separately. And this possibility comes with a number of interesting properties described in the article above such as O(1) removal, support for duplicate keys, etc...
The benefit of using ebtrees in haproxy compared to rbtrees is the O(1) removal which makes ebtrees much faster than rbtrees for the scheduler where entries are constantly added/removed. And compared to BST (which was the original design leading to ebtrees), insertion is very fast (no malloc) and remoal doesn't require a free().
A new version is under development to save space. It will have the same complexity as rbtrees but with smaller memory usage. This will be useful to store lots of data which are often looked up and rarely removed (eg: haproxy's stick tables, caches, ...).

Best PostgreSQL hiearchical tree for both performance and moving nodes from GUI?

Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).
Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.
As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?
Requirements:
Query performance is of course my top priority (all nodes, subtrees, leafs).
The tree should support deep level nesting, and sorting
And of course the tree should have support for growing large and
scaling big
I can live with a little waiting time while reordering from the GUI,
if 1 "jack-of-all-trades" tree implementation doesn't exist, or is
too complex for being worth it.
I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.
Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?
The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.
As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.
When write performance is important for you, you'll usually be better off working on the raw (id, parent_id) tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.
An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.
(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)
As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.
That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

Database Optimization techniques for amateurs

Can we get a list of basic optimization techniques going (anything from modeling to querying, creating indexes, views to query optimization). It would be nice to have a list of these, one technique per answer. As a hobbyist I would find this to be very useful, thanks.
And for the sake of not being too vague, let's say we are using a maintstream DB such as MySQL or Oracle, and that the DB will contain 500,000-1m or so records across ~10 tables, some with foreign key contraints, all using the most typical storage engines (eg: InnoDB for MySQL). And of course, the basics such as PKs are defined as well as FK contraints.
Learn about indexes, and use them properly. Generally speaking*, follow these guidelines:
Every table should have a clustered index
Fields used for filters and sorts are good candidates for indexing
More selective fields are better candidates for indexing
For best performance on crucial queries, design "covering indexes" for those queries
Make sure your indexes are actually being used, and remove those that aren't
If your table has 15 fields, and you make 15 indexes, each with only a single field, you're doing it wrong :)
*There are some exceptions to these rules if you know what you're doing. My experience is Microsoft SQL Server, but I would presume most of this advice would still apply to a different RDMS.
IMO, by far the best optimization is to have the data model fit the problem domain for which it was built. When it does not, the resulting symptom is difficult-to-write or convoluted queries in order to get the information desired and that typically rears itself when reports are built against the database. Thus, in designing a database it helps to have an idea as to the types and nature of the information, such as reports, that the users will want from the system.
When talking database design, check out the database normalization, e.g. the wikipedia article: Normal forms.
If you have a good design and still you need to optimize for performance, try Denormalisation.
If you have specific needs which are not covered by relational model efficiently, look at other models covered by the term NoSQL.
Some query/schema optimizations:
Be mindful when using DISTINCT or GROUP BY. I find that many new developers will use DISTINCT in places where it really is not needed or could be rewritten more efficiently using an Exists statement or a derived query.
Be mindful of Left Joins. All too often I find new SQL developers will ignore the schema in place and use Left Joins where they really are not necessary. For example:
Select
From Orders
Left Join Customers
On Customers.Id = Orders.CustomerId
If Orders.CustomerId is a required column, then it is not necessary to use a left join.
Be a student of new features. Currently, MySQL does not support common-table expressions which means that some types of queries are cumbersome and probably slower to write than they would be if CTEs were supported. However, that will not be true forever. Keep up on new syntax features in MySQL which might be used to make existing queries more efficient.
You do not have to use surrogate keys everywhere. There might be tables better suited to an intelligent key (e.g. US State abbreviations, Currency Codes etc) which would enable developers to avoid additional joins in many cases.
If possible, find ways of archiving data to an OLAP or reporting server. The smaller you can make the production data, the faster it will run.
A design that concisely models your problem is always a good start. Overgeneralizing the data model can lead to performance problems. For example, I've heard reports of projects striving for uber-flexibility that use the RDBMS as a dumb "name/value" store - and resulting performance was appalling.
Once a good design is in place, then use the tools provided by the RDBMS to help it achieve good performance. Single field PKs (no composites), but composite business keys as an index with unique constraint, use of appropriate data types, e.g. using appropriate numeric types for numeric values rather than char or similar. Physical attributes of the hardware the RDBMS is running on should also be considered, since the bulk of query time is often disk I/O - but of course don't take this for granted - use a profiler to find out where the time is going.
Depending upon the update/query ratio, materialized views/indexed views can be useful in improving performance for slow running queries. A poor-man's alternative is to use triggers to invoke a procedure that populates the table with a result of a slow-running, infrequently-changed view.
Query optimization is a bit of a black art since it is often database-dependent, but some rules of thumb are given here - Optimizing SQL.
Finally, although possibly outside the intended scope of your question, use a good data access layer in your application, and avoid the temptation to roll your own - there are surely tested and performant implementations available for all major languages. Use of caching at the data access layer, middle tier and application layer can help improve performance considerably.
Do use less query whenever possible. Use "JOIN", and group your tables so that a single query gives your results.
A good example is the Modified Preorder Tree Transversal (MPTT) to get all of a tree node parents, ordered, in a single query.
Take a holistic approach to optimization.
Consider the impact of slow disks, network latency, lack of memory, and server load.