R-tree - Remove algorithm using reinsertion - scala

I am trying to implement an R-tree in scala following the guidelines from the original paper about the R-tree structure. In the deletion algorithm section is stated:
Reinsert all entries of nodes in set Q. Entries from eliminated leaf nodes are reinserted in tree leaves as described in Insert, but entries from higher level nodes must be placed higher in the tree, so that leaves of their depedent subtrees will be on the same level as leaves of the main tree.
I can't wrap my head around the last part. What is meant by higher level nodes must be placed higher in the tree? How is that implemented? My idea was that I remove nodes that underflow add them to the set Q (their entries) and in the end I reinsert their entries using Insert. Is this incorrect or partially correct that requires something extra? If you can explain using examples as well that would be great.

Nodes must be reinserted in the correct height, or the tree will become invalid. Remember that all leaves must be at the same level.

Inserting and removing values in R-Tree is a quite expensive operation when you need to keep it optimally balanced for fast window or nearest requests, especially in a multi-thread environment.
A more efficient approach is using one writer (actor or thread) which gathers updates in batches, packs a new R-Tree instance and publishes it in some volatile variable for reading.
Here is a comparison of some R-Tree implementations that can be used in such a way from Scala.

Related

Elastic Binary Search Tree in Haproxy

I just look at the source of HAproxy to learn about how is it implemented , and I see an interesting data structure called Elastic Binary Search tree. It seems to be very similar to binary search tree. But I would like to know what is the different and the reason behind choosing this data structure for load balancer.
You'll find the implementation details here : http://1wt.eu/articles/ebtree/
In short, the main difference between a regular binary tree and ebtree s that in a regular binary tree, you need to allocate intermediary nodes to attach leaves, and in some environments, having to allocate a node in the middle just to insert a leaf is not convenient. With ebtrees, each structure is both a node and a leaf, and thanks to some pointer manipulation, both of them can be used separately. And this possibility comes with a number of interesting properties described in the article above such as O(1) removal, support for duplicate keys, etc...
The benefit of using ebtrees in haproxy compared to rbtrees is the O(1) removal which makes ebtrees much faster than rbtrees for the scheduler where entries are constantly added/removed. And compared to BST (which was the original design leading to ebtrees), insertion is very fast (no malloc) and remoal doesn't require a free().
A new version is under development to save space. It will have the same complexity as rbtrees but with smaller memory usage. This will be useful to store lots of data which are often looked up and rarely removed (eg: haproxy's stick tables, caches, ...).

Best PostgreSQL hiearchical tree for both performance and moving nodes from GUI?

Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).
Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.
As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?
Requirements:
Query performance is of course my top priority (all nodes, subtrees, leafs).
The tree should support deep level nesting, and sorting
And of course the tree should have support for growing large and
scaling big
I can live with a little waiting time while reordering from the GUI,
if 1 "jack-of-all-trades" tree implementation doesn't exist, or is
too complex for being worth it.
I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.
Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?
The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.
As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.
When write performance is important for you, you'll usually be better off working on the raw (id, parent_id) tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.
An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.
(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)
As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.
That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.

PostgreSQL ltree- vs tree module vs integer/string arrays or string delimited path

As you may know there's a module for PostgreSQL called ltree. Also you have the possibility to use the Array type for integers (*1, see comment below), which in this test shows to actually perform a little slower with its recursive queries, compared to ltree - except from the string indexing (*2, see comment below).
I'm not too sure about the credibility of these testresults though.
My biggest question here is actually about the relatively unknown, and almost undocumented tree module. Described here (where the documentation also can be found!!) as:
support for hierachical data types (sort of lexicographical trees),
should go to contrib/tree, pending because of lack of proper
documentation.
After reading through the documentation I'm a little bit confused as to whether or not I should base my big application (a CMS, where everything will be stored in a hiearchical tree structure - not only content, also files etc, so you can see this quickly scales up) around ltree, normal Materialized Path (Path Enumeration) with a delimited string or integer array as path - or if the relatively unknown "tree" module in theory should be the faster performing, more scalable and better solution of the two.
I've already analysed the different tree structure models and due to query performance, scalability and reordering of nodes and subtrees being my main requirements, I've been able to rule out Adjacency Lists (recursive CTE will not solve performance as the tree scales huge), Nested Sets/Intervals (not fast enough in some queries, considering its disadvantages when manupulating the tree), Closure Tables (terribly at scaling big in complex trees - not useful for such a large project as mine) etc and decided to go with the Materialized Path, which is super fast for read operations, and makes it easy to move subtrees and nodes around the hiearchy. So the question is only about the best of the proposed implementations for Materialized Path.
I'm especially curious in hearing your theories or experiences with "tree" in PostgreSQL.
AS far as I read, contrib/tree was never officially released, whereas ltree was merged into PostgreSQL's core.
I understand both use the same idea of labeled path, but tree only allowed integer labels, when ltree allows text labels that permits fulltext searches, thought the full path length is limited (65Kb max, 2Kb prefered).

scala/akka/stm design for large shared state?

I am new to Scala and Akka and am considering using it to solve a problem. Suppose I have a calculation engine (that searches for a solution). I'd like to parallelize that search both across cpus and across nodes by giving each cpu on each node its own engine instance.
The engine inputs consist of a small number of scalar inputs and a very large hash table. Each engine instance would use its scalar inputs to make some small local change to the hash table, calculate a goodness, then discard its changes (they do not need to be committed/seen by any other engine instance). The goodness value would be returned to some coordinator that would choose among the results.
I was reading some about the STM TransactionalMap as a vehicle for shared state. This seems ideal, but I don't really see any complete examples using it as shared state.
Questions:
Does the actor/stm model seem right for this problem?
Can you show a specific example of how to distribute the shared state? (is it Ref[TransactionalMap[,]] as a message?
Is there anything different about distributing the shared state within a node as opposed to across different nodes?
Inquiring Minds Want to Know,
Allan
In terms of handling shared memory it doesn't sound like STM would be the right fit here because you don't want the changes made in engine instances to commit to the shared copy of the hash table.
Instead, an immutable HashMap might be a better fit. The elements that do not change in the map can be shared by the engine instances with only the differences in each map taking additional memory space.
The actor model would fit very well what you want to do. Set up one actor for each engine instance you want and pass it a message with the scalar values and the hashmap. Then have it return the results to the coordinator.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).