What is the correct way to utilize a previously created vertex index within a traversal?
I want find the vertex with a specific key/value. I thus created a unique index on a specific vertex class.
My traversal works but I have the feeling index is not being utilized (I inspected the call tree)
Vertex v = getGraph().getRawTraversal().V().has("testkey", P.eq("testvalue")).next();
My main reference are orientdb-gremlin unit tests:
https://github.com/orientechnologies/orientdb-gremlin/blob/58e604b1b5f5d26e545f7a4beb9640b8a342d822/driver/src/test/java/org/apache/tinkerpop/gremlin/orientdb/OrientGraphIndexTest.java#L157
The main doc does not yet contain any references to index usage. (Presumable because 3.x has not yet been released)
https://orientdb.com/docs/3.0.x/tinkerpop3/OrientDB-TinkerPop3.html
Related
Materialized Path is a method for representing hierarchy in SQL. Each node contains the path itself and all its ancestors (grandparent/parent/self).
The django-treebeard implementation of MP (docs):
Each step of the path is a fixed length for consistent performance.
Each node contains depth and numchild fields (fast reads at minimal cost to writes).
The path field is indexed (with a standard b-tree index):
The materialized path approach makes heavy use of LIKE in your database, with clauses like WHERE path LIKE '002003%'. If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a % character will use the index. This is what makes the materialized path approach so fast.
Implementation of get_ancestors (link):
Match nodes with a path that contains a subset of the current path (steplen is the fixed length of a step).
paths = [
self.path[0:pos]
for pos in range(0, len(self.path), self.steplen)[1:]
]
return get_result_class(self.__class__).objects.filter(
path__in=paths).order_by('depth')
Implementation of get_descendants (link):
Match nodes with a depth greater than self and a path which starts with current path.
return cls.objects.filter(
path__startswith=parent.path,
depth__gte=parent.depth
).order_by(
'path'
)
Potential downsides to this approach:
A deeply nested hierarchy will result in long paths, which can hurt read performance.
Moving a node requires updating the path of all descendants.
Postgres includes the ltree extension which provides a custom GiST index (docs).
I am not clear which benefits ltree provides over django-treebeard's implementation. This article argues that only ltree can answer the get_ancestors question, but as demonstrated earlier, figuring out the ancestors (or descendants) of a node is trivial.
[As an aside, if found this Django ltree library - https://github.com/mariocesar/django-ltree].
Both approaches use an index (django-treebeard uses b-tree, ltree uses a custom GiST). I am interested in understanding the implementation of the ltree GiST and why it might be a more efficient index than a standard b-tree for this particular use case (materialized path).
Additional links
What are the options for storing hierarchical data in a relational database?
https://news.ycombinator.com/item?id=709970
TL;DR Reusable labels, complex search patterns, and ancestry searches against multiple descendant nodes (or a single node whose path hasn't yet been retrieved) can't be accomplished using a materialized path index.
For those interested in the gory details...
Firstly, your question is only relevant if you are not reusing any labels in your node description. If you were, the l-tree is really the only option of the two. But materialized path implementations don't typically need this, so let's put that aside.
One obvious difference will be in the flexibility in the types of searches that l-tree gives you. Consider these examples (from the ltree docs linked in your question):
foo Match the exact label path foo
*.foo.* Match any label path containing the label foo
*.foo Match any label path whose last label is foo
The first query is obviously achievable with materialized path. The last is also achievable, where you'd adjust the query as a sibling lookup. The middle case, however, isn't directly achievable with a single index lookup. You'd either have to break this up into two queries (all descendants + all ancestors), or resort to a table scan.
And then there are really complex queries like this one (also from the docs):
Top.*{0,2}.sport*#.!football|tennis.Russ*|Spain
A materialized path index would be useless here, and a full table scan would be required to handle this. l-tree is the only option if you want to perform this as a SARGable query.
But for the standard hierarchical operations, finding any of:
parent
children
descendants
root nodes
leaf nodes
materialized path will work just as well as l-tree. Contrary to the article linked above, searching for all descendants of a common ancestor is very doable using a b-tree. The query format WHERE path LIKE 'A.%' is SARGable provided your index is prepared properly (I had to explicitly tag my path index with varchar_pattern_ops to get this to work).
What is missing from this list is finding all ancestors for a descendant. The query format WHERE 'A.B.C.D' LIKE path || '.%' is unfortunately not going to use the index. One workaround that some libraries implement is to parse out the ancestor nodes from the path, and query them directly: WHERE id IN ('A', 'B', 'C'). However, this will only work if you're targeting ancestors of a specific node whose path you have already retrieved. l-tree is going to win on this one.
UPDATE GeoAgentSummary set out = #45:0, in = #21:0, _2015 = sum(_2015, 10.0f) upsert where out = #45:0 and in = #21:0
I am using the above query to either create an edge (if it is not there) or update an existing edge if it already exists in OrientDB
An edge is created between #45:0 and #21:0.
But in Agent(vertex class having clusters 45, 46, 47 and 48) i.e. in #45:0 it is not showing any outgoing edges.
Agent Class a vertex class
I know that this question is three years old, but for somebody else who will google it further:
You can use “upsert” for edges since version 3.0.1 and it will work properly – but you need to do the following:
Create unique index on edge_class (out, in) and – it's strange – The order is important!
To do this, you need to create in and out properties first, otherwise db can't create index and there will be an exception when you will try to run command “Create index”.
Then, use command CREATE EDGE UPSERT FROM TO .
In this case edge will be created only if it is not exists, and it will create in and out properties for vertex classes.
But it still doesn't work for UPDATE command 'cos, as authors said, “The UPDATE/UPSERT works at document level, so it doesn't create the connections from the vertices. Using it, you will have a broken graph” and it still the same.
The UPDATE command acts like a normal document update without taking care of keeping the edge-vertex "synchronization". To do that you'd have to use the UPDATE EDGE that, however, doesn't support the UPSERT.
There is on open issue on github about that https://github.com/orientechnologies/orientdb/issues/4436
Read also this https://github.com/orientechnologies/orientdb/issues/1114
I'm new to Titan and looking for the best way to iterate over the entire set of vertices with a given label without running out of memory. I come from a strong SQL background so I am still working on switching my way of thinking away from SQL-type thinking. Let's say I have 1 million profile vertices. I would like to iterate over each one and perform some type of statistical analysis of the information linked to each profile. I don't really care how long the entire analysis process takes, but I need to iterate over all of the profiles. In SQL I would do SELECT * FROM MY_TABLE, using a scroll-sensitive result, fetch the next result, grab and process the info linked to that row, then fetch the next result. I also don't care if the result is real-time accurate as it is just for gathering general stats, so if a new profile is added during iteration and I miss it, that's ok.
Even if there is a way to grab all the values for a given property, that would probably work too because then I could go through that list and grab each vertex by its ID for example.
I believe titan does lazy loading so you should be able to just iterate over the whole graph:
GraphTraversal<Vertex, Vertex> it = graph.traversal().V();
while(it.hasNext()){
Vertex v = it.next():
//Do what you want here
}
Another option would be to use the range step so that you explicitly choose the range of vertices you need. For example:
List<Vertex> vertices = graph.traversal().V().range(0, 3).toList();
//Do what you want with your batch of vertices.
With regards to getting vertices of a specific type you can query vertices based on their internal properties. For example if you have and internal property "TYPE" which defined the type you are interested in. You can query for those vertices by:
graph.traversal().V().has("TYPE", "A"); //Gets vertices of type A
graph.traversal().V().has("TYPE", "B"); //Gets vertices of type B
I'm having my first steps as a TitanDB user. That being, I'd like to know how to make an upsert / conditionally insert a vertex inside a TitanTransaction (in the style of "get or create").
I have a unique index on the vertex/property I want to create/lookup.
Here's a one-liner "getOrCreate" for Titan 1.0 and TinkerPop 3:
getOrCreate = { id ->
g.V().has('userId', id).tryNext().orElseGet{ g.addV('userId', id).next() }
}
As taken from the new TinkerPop "Getting Started" Tutorial. Here is the same code translated to java:
public Vertex getOrCreate(Object id) {
return g.V().has('userId', id).tryNext().orElseGet(() -> g.addV('userId', id).next());
}
Roughly speaking, every Cassandra insert is an "upsert". If you look at the Titan representation of vertices and edges in a Cassandra-like model, you'll find vertices and edges each get their own rows. This means a blind write of an edge will have the given behavior you're looking for: what you write is what will win. Doing this with a vertex isn't supported directly by Titan.
But I don't think this is what you're looking for. If you're looking to enforce uniqueness, why not use the unique() modifier on a Titan composite index? (From the documentation):
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
With a Cassandra storage backend, you need to enable consistency locking. As with any database, managing the overhead of uniqueness comes at a cost which you need to consider when writing your data. This way if you insert a vertex which violates your uniqueness requirement, the transaction will fail.
If I create an index according to the docs (http://s3.thinkaurelius.com/docs/titan/0.5.4/indexes.html) without making it unique is it possible to make it unique after? I have not added any vertices or edges to the graph, just created the index.
Something like:
index = mgmt.getGraphIndex('name')
index.unique()
I am using the Gremlin console to make these changes.
Is it possible to do this somehow?
This is a documented limitation of Titan.
Ref : http://s3.thinkaurelius.com/docs/titan/0.5.0/limitations.html
section - 14.2.1. Unable to Drop Indices
Since no vertices or edges are added to graph, try the below gremlin command.
g.V.remove() or g.V.each{g.removeVertex(it)}
g.commit()
Then try to create the indexes again with .unique().
If still unable to re-create the indices, try to clean storage-backend.
In case of cassandra "DROP Keyspace titan;"
This must definitely work,I have tried in Titan 0.4 and worked.