Is the relationship between index tuple in GiST index and user table row many to one or one to one? - postgresql

In a regular b-tree index, the leaf node contains a key and a pointer to the heap tuple (user table row), which signifies that in b-tree, the relationship between index tuple and user table row is one-to-one.
Just like in a b-tree, a GiST leaf node also contains a key datum and info about where the heap tuple is stored, but GiST leaves may or may not contain entire row data in its keys (please correct me if I'm wrong). So, if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
Is all this correct?

A GiST index is a generalization of a B-tree index.
In a non-leaf block of a B-tree index, two consecutive index entries define the boundary for the indexed values in the subtree at the destination of the pointer between these index entries:
In other words, each pointer to the next lower level is labeled with an interval that contains all values in the subtree.
This only works for data types with a total ordering.
The GiST index extends that concept. Each entry in a non-leaf node has a condition that the subtree under that index entry has to satisfy.
When scanning a GiST index, I search the index page for all entries that may contain values matching my search condition. Since there is no total ordering, it is possible (but of course not desirable) that the conditions somehow “overlap” so that something I search for can have matches in more than one of the entries. In that case I have to descend into all the referenced subtrees, but I can skip those where the entry's condition guarantees that the subtree cannot contain entries that match my search condition.
This is a little abstract, so let's flesh it out with an example.
One of the classical examples of a GiST index is an R-tree index, a kind of geographical index like it is used by PostGIS:
Here the condition of an index entry is a bounding box that contains the bounding boxes of all geometries contained in the subtree of the index entry. So whan searching for a geometry, I take its bounding box and see which of the index entries in a page contains this bounding box. These are the subtrees into which I have to descend.
One thing that can be seen in this example is that a GiST index can be lossy, that is, it gives me a neccesary, but not sufficient condition if I have found a hit. The leaf entries found in a GiST index scan always have to be rechecked if the actual table entry also satisfies the condition (not every geometry is a rectangle). This is why a GiST index scan is always a bitmap index scan in PostgreSQL.
This all sounds nice and simple. The hard part about a good GiST index is the picksplit algorithm that decides upon an index page split which of the index entries comes into which of the two new index pages. The better this works, the more efficient the index will be.
So you see, a GiST index is “somewhat like” a B-tree index in many respects. You can see a B-tree index as an optimized special case of a GiST index (see the btree-gist contrib module).
This lets me answer your questions:
GiST leaf node also contains key datum and info about where the heap tuple is stored
This is true.
GiST leaves may or may not contain entire row data in its keys
Of course the index entry does not contain the entire row. But I think you mean the right thing. The condition in the GiST leaf can be broader than the actual object in the table, like a bounding box is bigger than a geometry.
if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
This is wrong. Even though a value may satisfy several of the entries in a GiST index page, it is only contained in one of the subtrees, and only one leaf page entry points to any given table row. It is a one-to-one relationship, just like in a B-tree index.

Related

GIN Index implementation

Generally Trigram Indexes are supposed to store the trigrams of the values in the index value.
I have understood the structure of GIN Index and how they store the values.
One thing I am stuck with is, whether they would store the trigrams of the texts given or the texts themselves.
I've read some articles and they all show gin index storing words with tsvector
Now If this is the case, GIN index shouldn't be working for searches like
SELECT * FROM table WHERE data LIKE '%word%';
But it seems to work for such a case too. I have used a database of a million rows where the column I'm searching on is a random text of size 30. I haven't used tsvector since the column is just a single word of size 30.
Example Column Value: bVeADxRVWpCeEHyNLxxfkfVkSAKkKw
But on using gin index on this column using trgm_gin_ops,
The fuzzy search seems to be much much faster. It works well.
But if gin is just storing the words as it is shown in the above image, it should'nt work for %word%. but it does, which leads me to ask the question: are gin indexes simply made up of the text values or the trigrams of the text values ?
My whole question can be simplified into this:
If I create an index a column with values like this 'bVeADxRVWpCeEHyNLxxfkfVkSAKkKw', would gin simply index this value or would it store the trigrams of the values in it's index tree. (bVe, VeA, eAD,...., kKw)
The G in GIN stands for generalized. It just works with a list of tokens per tuple-field to be indexed, but what that token actually represents depends on the operator class to define and extract. The default operator class for tsvector uses stemmed words, the operator class "gin_trgm_ops" (which is for text, but not the default one for text) uses trigrams. An example based on one will have limited applicability to the other. To understand it in a generalized way, you need to consider the tokens to just be labels. One token can point to many rows, and one row can be pointed to by many tokens. Once you get into what the tokens mean, that is the business of the operator class, not of the GIN machinery itself.
When using gin_trgm_ops, '%word%' breaks down to 'wor' and 'ord', both of which must be present in the index (for the same row) in order for '%word%' to possibly match. But 'ordinary worry' also has both of those trigrams in it, so it would pass the bitmap index scan but then be rejected by the recheck

Postgres Materialized Path - What are the benefits of using ltree?

Materialized Path is a method for representing hierarchy in SQL. Each node contains the path itself and all its ancestors (grandparent/parent/self).
The django-treebeard implementation of MP (docs):
Each step of the path is a fixed length for consistent performance.
Each node contains depth and numchild fields (fast reads at minimal cost to writes).
The path field is indexed (with a standard b-tree index):
The materialized path approach makes heavy use of LIKE in your database, with clauses like WHERE path LIKE '002003%'. If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a % character will use the index. This is what makes the materialized path approach so fast.
Implementation of get_ancestors (link):
Match nodes with a path that contains a subset of the current path (steplen is the fixed length of a step).
paths = [
self.path[0:pos]
for pos in range(0, len(self.path), self.steplen)[1:]
]
return get_result_class(self.__class__).objects.filter(
path__in=paths).order_by('depth')
Implementation of get_descendants (link):
Match nodes with a depth greater than self and a path which starts with current path.
return cls.objects.filter(
path__startswith=parent.path,
depth__gte=parent.depth
).order_by(
'path'
)
Potential downsides to this approach:
A deeply nested hierarchy will result in long paths, which can hurt read performance.
Moving a node requires updating the path of all descendants.
Postgres includes the ltree extension which provides a custom GiST index (docs).
I am not clear which benefits ltree provides over django-treebeard's implementation. This article argues that only ltree can answer the get_ancestors question, but as demonstrated earlier, figuring out the ancestors (or descendants) of a node is trivial.
[As an aside, if found this Django ltree library - https://github.com/mariocesar/django-ltree].
Both approaches use an index (django-treebeard uses b-tree, ltree uses a custom GiST). I am interested in understanding the implementation of the ltree GiST and why it might be a more efficient index than a standard b-tree for this particular use case (materialized path).
Additional links
What are the options for storing hierarchical data in a relational database?
https://news.ycombinator.com/item?id=709970
TL;DR Reusable labels, complex search patterns, and ancestry searches against multiple descendant nodes (or a single node whose path hasn't yet been retrieved) can't be accomplished using a materialized path index.
For those interested in the gory details...
Firstly, your question is only relevant if you are not reusing any labels in your node description. If you were, the l-tree is really the only option of the two. But materialized path implementations don't typically need this, so let's put that aside.
One obvious difference will be in the flexibility in the types of searches that l-tree gives you. Consider these examples (from the ltree docs linked in your question):
foo Match the exact label path foo
*.foo.* Match any label path containing the label foo
*.foo Match any label path whose last label is foo
The first query is obviously achievable with materialized path. The last is also achievable, where you'd adjust the query as a sibling lookup. The middle case, however, isn't directly achievable with a single index lookup. You'd either have to break this up into two queries (all descendants + all ancestors), or resort to a table scan.
And then there are really complex queries like this one (also from the docs):
Top.*{0,2}.sport*#.!football|tennis.Russ*|Spain
A materialized path index would be useless here, and a full table scan would be required to handle this. l-tree is the only option if you want to perform this as a SARGable query.
But for the standard hierarchical operations, finding any of:
parent
children
descendants
root nodes
leaf nodes
materialized path will work just as well as l-tree. Contrary to the article linked above, searching for all descendants of a common ancestor is very doable using a b-tree. The query format WHERE path LIKE 'A.%' is SARGable provided your index is prepared properly (I had to explicitly tag my path index with varchar_pattern_ops to get this to work).
What is missing from this list is finding all ancestors for a descendant. The query format WHERE 'A.B.C.D' LIKE path || '.%' is unfortunately not going to use the index. One workaround that some libraries implement is to parse out the ancestor nodes from the path, and query them directly: WHERE id IN ('A', 'B', 'C'). However, this will only work if you're targeting ancestors of a specific node whose path you have already retrieved. l-tree is going to win on this one.

Sphinx centralize multiple tables into a single index

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?
unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

How to formulate index_name in SQL?

I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])
I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx

indexes needed for table inheritance in postgres?

This is a fairly simple question, but it's one I can't find a firm answer on.
I have a parent table in PostgreSQL, and then several child tables which have been defined. A trigger has been established, and the children tables only have data inserted if a field, say field x, meets a certain criteria.
When I query the parent table with a field based upon x, PostgreSQL knows to immediately go to the child table that is related to that particular value of x.
That all being said, I don't need to specify a particular index on the column x do I? PostgreSQL already knows how to sort on it, and by adding an index to the parent x, PostgreSQL is therefore generating unique indexes on x for each of the new child tables.
Creating that index is a bit redundant, right?
Creating an index on the child table for x, if x only has one value (or a very, very small number of values) if probably a loss, yes. The planner would scan the whole table anyway.
If x is a timestamp and you're specifying a timeframe that may not be a whole partition, or if x is another range or set of values, an index would be a win most likely.
Edit: When I say one value or range of values, I mean, per child table.