I'm trying to figure out how to do a preorder traversal of a Btree. I know that generally preorder traversal works like this:
preorder(node)
{
print value in node
preorder(left child)
preorder(right child)
}
What's confusing to me is how to make this work with a Btree, since in each node there are multiple values and multiple child pointers. When printing values, do all the values in the node get printed before descending into the left child?
Each node looks like this:
child1 value1 child2 value2 child3 value3 child4
Also, why would anyone want to do a preorder traversal of a Btree, since an inorder traversal is what will display the values in ascending order?
Print all the values in the current node in some defined order (which is up to you, really, though left-to-right is a sensible default) then visit each child node (again, the order is up to you).
Related
In the documentation it isn't clear to me whether I need to iterate through in order with either next or perhaps foldl (it is mentioned that foldr goes in the opposite order to ordered_set so presuambly foldl goes in the same order) or if I can use select and rely upon it being ordered (assuming ordered_set table)
can I use select and rely upon it being ordered (assuming ordered_set table)
ets:select/2:
For tables of type ordered_set, objects are visited in the same order as in a first/next traversal. This means that the match
specification is executed against objects with keys in the first/next
order and the corresponding result list is in the order of that
execution.
ets:first/1:
Returns the first key Key in table Tab. For an ordered_set table, the
first key in Erlang term order is returned.
Table Traversal:
Traversals using match and select functions may not need to scan
the entire table depending on how the key is specified. A match
pattern with a fully bound key (without any match variables) will
optimize the operation to a single key lookup without any table
traversal at all. For ordered_set a partially bound key will limit the
traversal to only scan a subset of the table based on term order.
It would make no sense to me for a table of type ordered_set to return search results in a random order.
I am a newbie in the graph databases world, and I made a query to get leaves of the tree, and I also have a list of Ids. I want to merge both lists of leaves and remove duplicates in a new one to sum property of each. I cannot merge the first 2 sets of vertex
g.V().hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).emit().hasLabel('User').as('UsersList1')
.V().has('UserId', within('001','002')).as('UsersList2')
.select('UsersList1','UsersList2').dedup().values('petitions').sum().unfold()
Regards
There are several things wrong in your query:
you call V().has('UserId', within('001','002')) for every user that was found by the first part of the traversal
the traversal could emit more than just the leafs
select('UsersList1','UsersList2') creates pairs of users
values('petitions') tries to access the property petitions of each pair, this will always fail
The correct approach would be:
g.V().has('User', 'UserId', within('001','002')).fold().
union(unfold(),
V().has('Group','GroupId','G001').
repeat(out()).until(hasLabel('User'))).
dedup().
values('petitions').sum()
I didn't test it, but I think the following will do:
g.V().union(
hasLabel('Group').has('GroupId','G001').repeat(
outE().inV()
).until(hasLabel('User')),
has('UserId', within('001','002')))
.dedup().values('petitions').sum()
In order to get only the tree leaves, it is better to use until. Using emit will output all inner tree nodes as well.
union merges the two inner traversals.
In a regular b-tree index, the leaf node contains a key and a pointer to the heap tuple (user table row), which signifies that in b-tree, the relationship between index tuple and user table row is one-to-one.
Just like in a b-tree, a GiST leaf node also contains a key datum and info about where the heap tuple is stored, but GiST leaves may or may not contain entire row data in its keys (please correct me if I'm wrong). So, if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
Is all this correct?
A GiST index is a generalization of a B-tree index.
In a non-leaf block of a B-tree index, two consecutive index entries define the boundary for the indexed values in the subtree at the destination of the pointer between these index entries:
In other words, each pointer to the next lower level is labeled with an interval that contains all values in the subtree.
This only works for data types with a total ordering.
The GiST index extends that concept. Each entry in a non-leaf node has a condition that the subtree under that index entry has to satisfy.
When scanning a GiST index, I search the index page for all entries that may contain values matching my search condition. Since there is no total ordering, it is possible (but of course not desirable) that the conditions somehow “overlap” so that something I search for can have matches in more than one of the entries. In that case I have to descend into all the referenced subtrees, but I can skip those where the entry's condition guarantees that the subtree cannot contain entries that match my search condition.
This is a little abstract, so let's flesh it out with an example.
One of the classical examples of a GiST index is an R-tree index, a kind of geographical index like it is used by PostGIS:
Here the condition of an index entry is a bounding box that contains the bounding boxes of all geometries contained in the subtree of the index entry. So whan searching for a geometry, I take its bounding box and see which of the index entries in a page contains this bounding box. These are the subtrees into which I have to descend.
One thing that can be seen in this example is that a GiST index can be lossy, that is, it gives me a neccesary, but not sufficient condition if I have found a hit. The leaf entries found in a GiST index scan always have to be rechecked if the actual table entry also satisfies the condition (not every geometry is a rectangle). This is why a GiST index scan is always a bitmap index scan in PostgreSQL.
This all sounds nice and simple. The hard part about a good GiST index is the picksplit algorithm that decides upon an index page split which of the index entries comes into which of the two new index pages. The better this works, the more efficient the index will be.
So you see, a GiST index is “somewhat like” a B-tree index in many respects. You can see a B-tree index as an optimized special case of a GiST index (see the btree-gist contrib module).
This lets me answer your questions:
GiST leaf node also contains key datum and info about where the heap tuple is stored
This is true.
GiST leaves may or may not contain entire row data in its keys
Of course the index entry does not contain the entire row. But I think you mean the right thing. The condition in the GiST leaf can be broader than the actual object in the table, like a bounding box is bigger than a geometry.
if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
This is wrong. Even though a value may satisfy several of the entries in a GiST index page, it is only contained in one of the subtrees, and only one leaf page entry points to any given table row. It is a one-to-one relationship, just like in a B-tree index.
I am trying here to get a insight on how the B tree is created.
Lets say i am using a number as a index variable. How will the tree be created with depth =1 or Would it be like this - http://bit.ly/ygwlEp
If so what would be the depth of the tree and what would be the maximum number of children.
For compound keys (say 2 index variables), will there be two trees. Or would it be a single tree with first level as first key and second level as second key ?
Say i take timestamp as the index key. Can i make it as a tree with first layer as years , second as month , and third as day . Can mongoDB automatically parse this information out?
How will the tree be created with depth =1 or Would it be like this - http://bit.ly/ygwlEp
Your picture shows a "binary tree" not a "b-tree", these are different.
"B-tree" works by creating buckets of a given size (believe MongoDB uses 4k) and ordering items within those buckets.
If so what would be the depth of the tree and what would be the maximum number of children
Please take a look at the Wikipedia entry on B-trees, it should provide a definitive answer for you.
For compound keys (say 2 index variables), will there be two trees.
Only one tree. However the key stored in the tree is basically the BSON representation of both items "mushed" together.
Say i take timestamp as the index key. Can i make it as a tree with first layer as years , second as month , and third as day . Can mongoDB automatically parse this information out?
No, you have no control over the indexing structure.
No MongoDB does not support any special parsing on dates in indexes.
If you do a comparison operation for timestamps, you will need to send in another timestamp.
I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)