Materialized Path is a method for representing hierarchy in SQL. Each node contains the path itself and all its ancestors (grandparent/parent/self).
The django-treebeard implementation of MP (docs):
Each step of the path is a fixed length for consistent performance.
Each node contains depth and numchild fields (fast reads at minimal cost to writes).
The path field is indexed (with a standard b-tree index):
The materialized path approach makes heavy use of LIKE in your database, with clauses like WHERE path LIKE '002003%'. If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a % character will use the index. This is what makes the materialized path approach so fast.
Implementation of get_ancestors (link):
Match nodes with a path that contains a subset of the current path (steplen is the fixed length of a step).
paths = [
self.path[0:pos]
for pos in range(0, len(self.path), self.steplen)[1:]
]
return get_result_class(self.__class__).objects.filter(
path__in=paths).order_by('depth')
Implementation of get_descendants (link):
Match nodes with a depth greater than self and a path which starts with current path.
return cls.objects.filter(
path__startswith=parent.path,
depth__gte=parent.depth
).order_by(
'path'
)
Potential downsides to this approach:
A deeply nested hierarchy will result in long paths, which can hurt read performance.
Moving a node requires updating the path of all descendants.
Postgres includes the ltree extension which provides a custom GiST index (docs).
I am not clear which benefits ltree provides over django-treebeard's implementation. This article argues that only ltree can answer the get_ancestors question, but as demonstrated earlier, figuring out the ancestors (or descendants) of a node is trivial.
[As an aside, if found this Django ltree library - https://github.com/mariocesar/django-ltree].
Both approaches use an index (django-treebeard uses b-tree, ltree uses a custom GiST). I am interested in understanding the implementation of the ltree GiST and why it might be a more efficient index than a standard b-tree for this particular use case (materialized path).
Additional links
What are the options for storing hierarchical data in a relational database?
https://news.ycombinator.com/item?id=709970
TL;DR Reusable labels, complex search patterns, and ancestry searches against multiple descendant nodes (or a single node whose path hasn't yet been retrieved) can't be accomplished using a materialized path index.
For those interested in the gory details...
Firstly, your question is only relevant if you are not reusing any labels in your node description. If you were, the l-tree is really the only option of the two. But materialized path implementations don't typically need this, so let's put that aside.
One obvious difference will be in the flexibility in the types of searches that l-tree gives you. Consider these examples (from the ltree docs linked in your question):
foo Match the exact label path foo
*.foo.* Match any label path containing the label foo
*.foo Match any label path whose last label is foo
The first query is obviously achievable with materialized path. The last is also achievable, where you'd adjust the query as a sibling lookup. The middle case, however, isn't directly achievable with a single index lookup. You'd either have to break this up into two queries (all descendants + all ancestors), or resort to a table scan.
And then there are really complex queries like this one (also from the docs):
Top.*{0,2}.sport*#.!football|tennis.Russ*|Spain
A materialized path index would be useless here, and a full table scan would be required to handle this. l-tree is the only option if you want to perform this as a SARGable query.
But for the standard hierarchical operations, finding any of:
parent
children
descendants
root nodes
leaf nodes
materialized path will work just as well as l-tree. Contrary to the article linked above, searching for all descendants of a common ancestor is very doable using a b-tree. The query format WHERE path LIKE 'A.%' is SARGable provided your index is prepared properly (I had to explicitly tag my path index with varchar_pattern_ops to get this to work).
What is missing from this list is finding all ancestors for a descendant. The query format WHERE 'A.B.C.D' LIKE path || '.%' is unfortunately not going to use the index. One workaround that some libraries implement is to parse out the ancestor nodes from the path, and query them directly: WHERE id IN ('A', 'B', 'C'). However, this will only work if you're targeting ancestors of a specific node whose path you have already retrieved. l-tree is going to win on this one.
Related
I have an ltree column containing a tree with a depth of 3. I'm trying to write a query that can select all children at a specific depth (level 1 = get all parents, 2 = get all children, 3 = get all grandchildren). I know this is pretty straightforward with n_level:
SELECT path FROM hierarchies
WHERE
nlevel(path) = 1
LIMIT 1000;
I have 200,000 dummy records and it's pretty fast (~170 ms). However, this query uses a sequential scan. I think it'd be better to write it in a way that takes advantage of the ltree operators supported by the GiST index. Frustratingly, I can't seem to wrap my brain around them, and I haven't found a similar question on SO or DBA (besides this one on finding leaves)
Any advice is appreciated!
The only index that could support your query is a simple b-tree index on an expression.
create index on hierarchies((nlevel(path)))
Note however that it is quite possible for the planner to choose a sequential scan anyway, exemplary in the case the number of rows with level 1 is much more than other levels.
I have a Postgres query where we have several indices set up, including one on a text field where we have a GIN index. My understanding of this based on the pg_trgm documentation is that it's only applicable if the search string is made up of alphanumeric text. Testing bears this out and in a database with tens of millions of records, doing something like the following works great:
SELECT * FROM my_table WHERE target_field LIKE '%foo%'
I've read in various places that anything that's not an alphanumeric string is treated as a separate word in the trigram search, so something like the following also works quite well:
SELECT * FROM my_table WHERE target_field LIKE '%foo & bar%'
However someone ran a search that was literally just three question marks in a row and it triggered a full table scan. For some reason, when multiple ampersand or question marks are used alone in the query, they're being treated differently than a single one placed next to or among actual alpha-numeric characters.
The research I've done has implied that it might be how some database drivers handle the question mark, sometimes interpreting it as a parameter that needs to be supplied, but then gets confused because it can't find the parameters and triggers a table scan. I don't really believe this is the case. I might be inclined to believe it would throw an error rather than completing the query, but running it anyway seems like a design flaw.
What makes more sense is that a question mark isn't an alpha-numeric character and thus it's treated differently. In some technologies, common symbols such as & are considered alpha-numeric, but I don't think that's the case with Postgres. In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
What's weird is that I can search for %foo & bar%, which seems to work fine. I can even search for %&% and it returns quickly, though not with the results I wanted. But if I put (for example) three of them together like this: %&&&%, it triggers a full table scan.
After running various experiments, here's what I've seen:
%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan
%foo&bar%: uses the index, but returns no results
I think that all of those make sense until you get to #8 and #9. And if if the ampersand were a word boundary, shouldn't #10 return results?
Anyone have an explanation of why multiple consecutive punctuation characters would be treated differently than a single punctuation character?
I can't reproduce this in v11 on a table full of md5 hashes: I get seq scans (full table scans) for the first 3 of your patterns.
If I force them to use the index by setting enable_seqscan=false, then I go get it to use the index, but it is actually slower than doing the seq scan. So it made the right call there. How about for you? You shouldn't force it to use the index just on principle when it is actually slower.
It would be interesting to see the estimated number of rows it thinks it will return for all of those examples.
In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
The G in GIN is for "generalized". You can't make blanket statements like that about something which is generalized. They don't even need to operate on text at all. But in your case, you are using the LIKE operator, and the LIKE operator doesn't care about word boundaries. Any GIN index which claims to support the LIKE operator must return the correct results for the LIKE operator. If it can't do that, then it is a bug for it to claim to support it.
It is true that pg_trgm treats & and ? the same as white space when extracting trigrams, but it is obliged to insulate LIKE from the effects if this decision. It does this by two methods. One is that it returns "MAYBE" results, meaning all the tuples it reports must be rechecked to see if they actually satisfy the LIKE. So '%foo&bar%' and '%foo & bar%' will return the same set of tuples to the heap scan, but the heap scan will recheck them and so finally return a different set to the user, depending on which ones survive the recheck. The second thing is, if the pg_trgm can't extract any trigrams at all out of the query string, then it must return the entire table to then be rechecked. This is what would happen with '%%', '%?%', '%??%', etc. Of course rechecking all rows is slower than just doing the seq scan in the first place.
I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.
In a regular b-tree index, the leaf node contains a key and a pointer to the heap tuple (user table row), which signifies that in b-tree, the relationship between index tuple and user table row is one-to-one.
Just like in a b-tree, a GiST leaf node also contains a key datum and info about where the heap tuple is stored, but GiST leaves may or may not contain entire row data in its keys (please correct me if I'm wrong). So, if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
Is all this correct?
A GiST index is a generalization of a B-tree index.
In a non-leaf block of a B-tree index, two consecutive index entries define the boundary for the indexed values in the subtree at the destination of the pointer between these index entries:
In other words, each pointer to the next lower level is labeled with an interval that contains all values in the subtree.
This only works for data types with a total ordering.
The GiST index extends that concept. Each entry in a non-leaf node has a condition that the subtree under that index entry has to satisfy.
When scanning a GiST index, I search the index page for all entries that may contain values matching my search condition. Since there is no total ordering, it is possible (but of course not desirable) that the conditions somehow “overlap” so that something I search for can have matches in more than one of the entries. In that case I have to descend into all the referenced subtrees, but I can skip those where the entry's condition guarantees that the subtree cannot contain entries that match my search condition.
This is a little abstract, so let's flesh it out with an example.
One of the classical examples of a GiST index is an R-tree index, a kind of geographical index like it is used by PostGIS:
Here the condition of an index entry is a bounding box that contains the bounding boxes of all geometries contained in the subtree of the index entry. So whan searching for a geometry, I take its bounding box and see which of the index entries in a page contains this bounding box. These are the subtrees into which I have to descend.
One thing that can be seen in this example is that a GiST index can be lossy, that is, it gives me a neccesary, but not sufficient condition if I have found a hit. The leaf entries found in a GiST index scan always have to be rechecked if the actual table entry also satisfies the condition (not every geometry is a rectangle). This is why a GiST index scan is always a bitmap index scan in PostgreSQL.
This all sounds nice and simple. The hard part about a good GiST index is the picksplit algorithm that decides upon an index page split which of the index entries comes into which of the two new index pages. The better this works, the more efficient the index will be.
So you see, a GiST index is “somewhat like” a B-tree index in many respects. You can see a B-tree index as an optimized special case of a GiST index (see the btree-gist contrib module).
This lets me answer your questions:
GiST leaf node also contains key datum and info about where the heap tuple is stored
This is true.
GiST leaves may or may not contain entire row data in its keys
Of course the index entry does not contain the entire row. But I think you mean the right thing. The condition in the GiST leaf can be broader than the actual object in the table, like a bounding box is bigger than a geometry.
if I am able to store one part of my table data in one leaf node and the other part in another leaf node and make both of them point to one heap tuple, would it be possible? This will make the relationship between GiST index tuple and heap tuple many to one.
This is wrong. Even though a value may satisfy several of the entries in a GiST index page, it is only contained in one of the subtrees, and only one leaf page entry points to any given table row. It is a one-to-one relationship, just like in a B-tree index.
I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])
I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx