Evaluate every possible general tree construct by preorder and postorder - preorder

It's clear that we can't construct a unique BST by only preorder and postorder traversals. Those two can construct a Full binary search tree only. However , I'm curious about is there any way to find out the numbers of possible trees.
For example , there are 4 possible trees that I draw when it comes to the following input data: preorder:[5,3,2,1,4,7,6,8,9] and postorder:[1,2,4,3,6,9,8,7,5]. I'm looking for a faster way besides drawing it everytime.

Related

Keeping track of moves when pre-processing graphs

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

Structural Sharing in Scala Vector

Structural sharing in Scala List is straightforward and easy to understand. But Scala Vector is a more complicated data structure than a list. How is structural sharing achieved in Scala Vector?
Vector is basically a tree (trie) with 32-wide branching at each level. If you have a Vector of, say, 3000 elements and you want to index element 2045, for example, which converts to 100000010101 in binary, it will decompose it into 5-bit blocks to use as indices into the tree: 10 (i.e. 2) in the first branch then 00000 (i.e. 0) in the next, and finally 10101 (i.e. 21) in the terminal branch, and then there's the data.
Given this structure, it's easy to see how to structurally share things: you can share any sub-trees that aren't changed. So if you make a new vector with a different element 2045, you have to change not all 3000 elements but recreate "only" three arrays of size 32: the terminal one is replaced by a copy with its element 21 updated; then its parent has to be replaced by a copy with this new child in index 0; then its parent has to be replaced with the correct subtree in index 2.
Now, this provides quite a lot of structural sharing as long as you have far more than 32 elements in your vector, but it's still a pretty big overhead. Because of this, additions to the end of the vector are special-cased so that you just add to the existing array. The old Vectors still point to that array, but they think the end is earlier (and that part is unchanged) so it works out okay.
There's a more complex but similar scheme to allow addition at the front of a vector in a similar fashion (basically, by leaving space at the front and keeping track of where to point via indices and offsets in addition to the indexing scheme).
The trick as implemented doesn't work to allow alternating addition to both front and back, though, so there you effectively rebuild the trees every addition. Making a version with even better structural sharing would be possible, but it would probably be a bit slower to access.

Algorithm generation

I have a rather large(not too large but possibly 50+) set of conditions that must be placed on a set of data(or rather the data should be manipulated to fit the conditions).
For example, Suppose I have the a sequence of binary numbers of length n,
if n = 5 then a element in the data might be {0,1,1,0,0} or {0,0,0,1,1}, etc...
BUT there might be a set of conditions such as
x_3 + x_4 = 2
sum(x_even) <= 2
x_2*x_3 = x_4 mod 2
etc...
Because the conditions are quite complex in that they come from experiment(although they can be written down in logic form) and are hard to diagnose I would like instead to use a large sample set of valid data. i.e., Data I know satisfies the conditions and is a pretty large set. i.e., it is easier to collect the data then it is to deduce the conditions that the data must abide by.
Having said that, basically what I'm doing is very similar to neural networks. The difference is, I would like an actual algorithm, in some sense optimal, in some form of code that I can run instead of the network.
It might not be clear what I'm actually trying to do. What I have is a set of data in some raw format that is unique and unambiguous but not appropriate for my needs(in a sense the amount of data is too large).
I need to map the data into another set that actually is ambiguous to some degree but also has certain specific set of constraints that all the data follows(certain things just cannot happen while others are preferred).
The unique constraints and preferences are hard to figure out. That is, the mapping from the non-ambiguous set to the ambiguous set is hard to describe(which is why it is ambiguous). The goal, actually, is to have an unambiguous map by supplying the right constraints if at all possible.
So, on the vein of my initial example, I'm given(or supply) a set of elements and need some way to derive a list of constraints similar to what I've listed.
In a sense, I simply have a set of valid data and train it very similar to neural networks.
Then, after this "Training" I'm given the mapping function I can then use on any element in my dataset and it will produce a new element satisfying the constraint's, or if it can't, will give as close as possible an unambiguous result.
The main difference between neural networks and what I'm trying to achieve is I'd like to be able to use have an algorithm to code to be used instead of having to run a neural network. The difference here is the algorithm would probably be a lot less complex, not need potential retraining, and a lot faster.
Here is a simple example.
Suppose my "training set" are the binary sequences and mappings
01000 => 10000
00001 => 00010
01010 => 10100
00111 => 01110
then from the "Magical Algorithm Finder"(tm) I would get a mapping out like
f(x) = x rol 1 (rol = rotate left)
or whatever way one would want to express it.
Then I could simply apply f(x) to any other element, such as x = 011100 and could apply f to generate a hopefully unambiguous output.
Of course there are many such functions that will work on this example but the goal is to supply enough of the dataset to narrow it down to hopefully a few functions that make the most sense(at the very least will always map the training set correctly).
In my specific case I could easily convert my problem into mapping the set of binary digits of length m to the set of base B digits of length n. The constraints prevents some numbers from having an inverse. e.g., the mapping is injective but not surjective.
My algorithm could be a simple collection if statements acting on the digits if need be.
I think what you are looking for here is an application of Learning Classifier Systems, LCS -wiki. There are actually quite a few LCS open-source applications available, but you may need to experiment with the parameters in order to get a good result.
LCS/XCS/ZCS have the features that you are looking for including individual rules that could be heavily optimized, pressure to reduce the rule-set, and of course a human-readable/understandable set of rules. (Unlike a neural-net)

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).

R-tree implementation in matlab

please, any one tell me how we can implement the R-tree structure in matlab to speed the image retrieval system , I would like to inform you that my database space a feature vector of Color Histogram (Multidimensional ) and also I I have a distance vector for similarity measure...
thanks
I don't use Matlab. So I do not have any idea how much cost is associated in Matlab with index structures. It doesn't appear to be designed for such things.
R-Trees seem to make quite a difference. Judging from http://elki.dbs.ifi.lmu.de/wiki/Benchmarking some algorithms can benefit immensely from having a good index structure. The numbers on that web page are 5 to 7 times faster on a 110250 image color histogram data set.
From my experience, R-Trees can indeed be quite hard to get right. But only if you want to go the full way. If you have a static database, you can get easily away with a bulk loaded R-Tree. Neither the bulk loading nor the queries are very hard to do. R-Trees get messy once you want to do the R*-Tree optimizations with complex split strategies, reinsertions, balancing, and do all this efficiently and on-disk with smart caching. But as long as you are operating in-memory and do not dynamically add objects, a STR bulk-loaded R-tree will help a lot and be a lot easier to implement.
You might still be better off building on something that has already a working R-Tree. Say SQLite with the rtree module or ELKI mentioned above.
Implementing R-tree is not really a simple task. You can use matlab binding for the LidarK library, it should be fast enough. The code is here:
http://graphics.cs.msu.ru/en/science/research/3dpoint/lidark
If you decide to use kd-tree (which is typical for image retrieval), there's a good implementation too.
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
I'm not familiar with R-trees specifically but in general trees are dynamic data structures. Matlab doesn't really do dynamic data structures unless you start using its OO facilities. If you don't want to do that you can flatten your tree into a cell array. For example I'll write a (strictly) binary tree flattened into a cell array, which will save me having to draw a tree. Here goes:
{1,{2},{3}}
which represents a binary tree with root 1 and branches left to 2, right to 3. I can make this deeper:
{1,{2,{5,6}},{3,{7,8}}}
which adds another level to the previous tree. If you want to add data at any of the nodes, then your (first) tree might look like this:
{1,[a b c],{2,[e f]},{3,[h i j k l]}}
An alternative to this would be to define your nodes separately, like this
node1 = [a b c]; node2 = [e f]; node3 = [h i j k l],
then your tree becomes
{node1, node2, node3}
Your problem then becomes writing functions to build and to traverse the tree in your chosen representation. Most tree functions are best written as recursions. Any good text, and lots of Internet sites, will tell you all that you want to know about such functions.