How is B tree created in Mongo DB - mongodb

I am trying here to get a insight on how the B tree is created.
Lets say i am using a number as a index variable. How will the tree be created with depth =1 or Would it be like this - http://bit.ly/ygwlEp
If so what would be the depth of the tree and what would be the maximum number of children.
For compound keys (say 2 index variables), will there be two trees. Or would it be a single tree with first level as first key and second level as second key ?
Say i take timestamp as the index key. Can i make it as a tree with first layer as years , second as month , and third as day . Can mongoDB automatically parse this information out?

How will the tree be created with depth =1 or Would it be like this - http://bit.ly/ygwlEp
Your picture shows a "binary tree" not a "b-tree", these are different.
"B-tree" works by creating buckets of a given size (believe MongoDB uses 4k) and ordering items within those buckets.
If so what would be the depth of the tree and what would be the maximum number of children
Please take a look at the Wikipedia entry on B-trees, it should provide a definitive answer for you.
For compound keys (say 2 index variables), will there be two trees.
Only one tree. However the key stored in the tree is basically the BSON representation of both items "mushed" together.
Say i take timestamp as the index key. Can i make it as a tree with first layer as years , second as month , and third as day . Can mongoDB automatically parse this information out?
No, you have no control over the indexing structure.
No MongoDB does not support any special parsing on dates in indexes.
If you do a comparison operation for timestamps, you will need to send in another timestamp.

Related

How can I create an arbitrary ranking of records in Postgres?

The Problem
I'm looking to create a user defined ranking of records in Postgres.
That is, the order in which the records are ranked is not defined by some underlying score but rather via the choices of a collection of users.
These choices are subject to frequent changes and the ranking will be constantly changing with both new records being added and existing records being moved to new positions.
For the sake of argument, assume that these operations occur with high frequency.
Furthermore, we need to be able to determine when given an arbitrary subset of all records, how they should be ordered according to the ranking.
A Naive Solution
A very naive solution would be to track the ranking as an integer directly on the model and 'push' all the higher ranked records up by one when inserting a new record. This is obviously not ideal as we would need to modify potentially the entire table at once.
A Better Solution
Maintain a 'score' on each record in the interval [0, 1]. This can be indexed using a BTREE and used to rank the records. The first two records would have the scores 0 and 1. When inserting a new record some intermediate value would be chosen (e.g. 0.5) and the record inserted. This choice could be optimised in order to minimise the number of digits in the score.
A Question
The above seems like a complex solution to a common problem. Furthermore, the problem is actually being solved by the underlying BTREE index with the score something of a hack to create the index.
Is there a neater way to solve the problem?

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Maintain N latest points in Postgres

My purpose is to have N latest points in Postgres in one row. When a new point is added, remove the oldest point.
For example say I have N+1 points. 1 userid and N integers.
Now when a new integer comes for a user, I want to remove the oldest entry and add the new entry. Obviously the issue is performance here. Since I am adding only one new integer, I want some way to do it fast.
I tried one very naive way by keeping two columns
userid, json
where json was list of all integers. I would remove the first entry and append new entry and dump json in postgres. Undoubtedly it is not performing well.
Please suggest some good way to do it. Does Postgres has some min heap type of data structure which does it in much better than linear complexity?

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.