Filtering a return group by traversal in ArangoDB - orientdb

I'm in the process of evaluating ArangoDB to be used instead of OrientDB. My dataset is essentially a forest of not-necessarily connected trees (a family tree).
Because the dataset is a directed acyclic graph (a tree), it's always more efficient to walk up the tree looking for something than down the tree.
In earlier versions of OrientDB, before they removed this critical feature for me, I was able to do the following query:
SELECT FROM Person WHERE haircolor = "Red" and in traverse(0, -1, "in") (birth_country = "Ireland")
Since haircolor is an indexed field, it's efficient to get all of those vertices. The magic is in the traverse operator within the WHERE clause, which stops traversal and immediately returns TRUE if it locates any ancestor from Ireland.
Yes, you can turn it around and look for all those from Ireland, and then walk downward looking for those pesky redheads, returning them, but it is substantially less efficient, since you have to evaluate every downward path, which potentially expands exponentially.
Since OrientDB shot themselves in the foot (in my opinion) by taking that feature out, I'm wondering if there's an ArangoDB query that would do a similar task without walking down the tree.
Thanks in advance for your help!

In AQL, it would go something like this:
FOR redhead IN Person // Find start vertices
FILTER doc.haircolor == "Red"
FOR v, e, p IN 1..99 INBOUND redhead Ancestor // Traversal up to depth 99
PRUNE v.birth_country == "Ireland" // Don't walk further if condition is met
RETURN p // Return the entire path
This assumes that the relations (edges) are stored in an edge collection called Ancestor.
PRUNE prevents further traversal down (or here: up) the path but includes the node that it is at.
https://www.arangodb.com/docs/stable/aql/graphs-traversals.html#pruning
Note that the variable depth traversal returns not only the longest paths but also "intermediate" paths of the same route. You may want to filter them out on the client-side or take a look at this AQL solution at the cost of additional traversals: https://stackoverflow.com/a/64931939/2044940

Related

How to ensure that only one item is added to janusgraph

Is there a way that I can ensure that any creation of a vertex in janusgraph with a given set of properties only results in one such vertex being created?
Right now, what I do is I traverse the graph and ensure that the number of vertices I find with particular properties is only one. For example:
val g = graph.traversal
val vertices = g.V().has("type", givenType).has("name", givenName).toList
if (vertices.size > 1) {
// the vertex is not unique, cannot add vertex
}
This can be done with the so called get or create traversal which is described in TinkerPop's Element Existence recipe and in the section Using coalesce to only add a vertex if it does not exist of the Practical Gremlin book.
For your example, this traversal would look like this:
g.V().has("type", givenType).has("name", givenName).
fold().
coalesce(unfold(),
addV("yourVertexLabel").
property("type", givenType).
property("name", givenName))
Note however, that it depends on the graph provider whether this is an atomic operation or not. In your case of JanusGraph, the existence check and the conditional vertex addition are executed with two different operations which can lead to a race condition when two threads execute this traversal at the same time in which case you can still end up with two vertices with these properties. So, you currently need to ensure that two threads can't execute this traversal for the same properties in parallel, e.g., with locks in your application.
I just published a blog post about exactly this topic: How to Avoid Doppelgängers in a Graph Database if you want to get more information about this topic in general. It also describes distributed locking as a way to implement locks for distributed systems and discusses possible improvements to better support upserts in JanusGraph in the future.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

Optimizing a Prefix Tree in OrientDB

In my project, I have a fairly large prefix tree, potentially containing millions of nodes (about 250K nodes in my development instance), managed in OrientDB (pointing to other vertices in my graph).
The nodes of the prefix tree are represented by a Token vertex type. Each Token has a 'key' property and is connected to its child vertices by a 'child' edge type. So, a sequence like "hello world" would be represented as:
root -child-> "hello" -child-> "world"
Currently, I have a NOTUNIQUE_HASH_INDEX on Token.key and I am querying the data structure like this:
SELECT EXPAND(OUT('child')[key=:k]) FROM :p
where k is the child key I am looking for and p is the RID of the parent node.
Generally, performance is pretty good, but I am looking for ideas on improving the query, the indexing, or both for this use case. In particular, queries starting at the root node, which has many children, take noticeably longer than the other, less-connected nodes.
Any suggestions? Thanks in advance!
Luigi Dell'Aquila from the OrientDB team provided an excellent answer on the OrientDB Google Group. To summarize, the following query (suggested by Luigi) dramatically improved performance.
SELECT FROM Token where key = :k AND in('Child') contains :p
I just ran a realistic test and query time was reduced by 97%! See https://groups.google.com/forum/#!topic/orient-database/mUkz6Z7hSwk for more details.

Retrieve all paths from a node

I'm using OrientDB Community Edition 2.1.16.
This is the graph of my data:
I'm trying to retrieve all paths for given node using:
select $path from (traverse out('E1') from #13:5)
But what I get it's quite strange:
I would have expected that every path passing through second level nodes (#13:1,#13:2,#13:3) would have reached the root node (#13:0).
Something like:
(#13:5).out[0](#13:4).out[0](#13:1).out[0](#13:0)
(#13:5).out[0](#13:4).out[1](#13:2).out[0](#13:0)
(#13:5).out[0](#13:4).out[2](#13:3).out[0](#13:0)
It's that correct or what?
If yes, is there the possibility to get this result?
I mean to have a complete path from #13:5 to #13:0 passing through the second levels' nodes.
Thanks
The result you get depends on the strategy has the traverse, you can set two types: DEPTH_FIRST, the default, and BREADTH_FIRST. I think maybe you interests of the two strategies. For more info you can look at this link.
DEPTH_FIRST strategy
This is the default strategy used by OrientDB for traversal. It explores as far as possible along each branch before backtracking. It's implemented using recursion. To know more look at Depth-First algorithm. Below the ordered steps executed while traversing the graph using DEPTH_FIRST strategy:
Depth-first-tree
BREADTH_FIRST strategy
It inspects all the neighboring nodes, then for each of those neighbor nodes in turn, it inspects their neighbor nodes which were unvisited, and so on. Compare BREADTH_FIRST with the equivalent, but more memory-efficient iterative deepening DEPTH_FIRST search and contrast with DEPTH_FIRST search. To know more look at Breadth-First algorithm. Below the ordered steps executed while traversing the graph using BREADTH_FIRST strategy:
Breadth-first-tree
using your query
select $path from (traverse out('E1') from #13:5)
you get the path relative to every result of the traverse, you can verify that by adding the *
select *,$path from (traverse out('E') from #9:5)
In this way you get all the vertexes traversed and the path to get there from starting node.

Gremlin: What's an efficient way of finding an edge between two vertices?

So obviously, a straight forward way to find an edge between two vertices is to:
graph.traversal().V(outVertex).bothE(edgeLabel).filter(__.otherV().is(inVertex))
I feel that filter step will have to iterate through all edges making really slow for some applications with a lot of edges.
Another way could be:
traversal = graph.traversal()
.V(outVertex)
.bothE(edgeLabel)
.as("x")
.otherV()
.is(outVertex) // uses index?
.select("x");
I'm assuming the second approach could be much faster since it will be using ID index which will make it faster than the first the approach.
Which one is faster and more efficient (in terms of IO)?
I'm using Titan, so you could also make your answer Titan specific.
Edit
In terms of time, seems like the first approach is faster (edges were 20k for vertex b
gremlin> clock(100000){g.V(b).bothE().filter(otherV().is(a))}
==>0.0016451789999999999
gremlin> clock(100000){g.V(b).bothE().as("x").otherV().is(a).select("x")}
==>0.0018231140399999999
How about IO?
I would expect the first query to be faster. However, few things:
None of the queries is optimal, since both of them enable path calculations. If you need to find a connection in both directions, then use 2 queries (I will give an example below)
When you use clock(), be sure to iterate() your traversals, otherwise you'll only measure how long it takes to do nothing.
These are the queries I would use to find an edge in both directions:
g.V(a).outE(edgeLabel).filter(inV().is(b))
g.V(b).outE(edgeLabel).filter(inV().is(a))
If you expect to get at most one edge:
edge = g.V(a).outE(edgeLabel).filter(inV().is(b)).tryNext().orElseGet {
g.V(b).outE(edgeLabel).filter(inV().is(a)).tryNext()
}
This way you get rid of path calculations. How those queries perform will pretty much depend on the underlying graph database. Titan's query optimizer recognizes that query pattern and should return a result in almost no time.
Now if you want to measure the runtime, do this:
clock(100) {
g.V(a).outE(edgeLabel).filter(inV().is(b)).iterate()
g.V(b).outE(edgeLabel).filter(inV().is(a)).iterate()
}
In case one does not know the vertex Id's, another solution might be
g.V().has('propertykey','value1').outE('thatlabel').as('e').inV().has('propertykey','value2').select('e')
This is also only unidirectional so one needs to reformulate the query for the opposite direction.