Gremlin- find and connect sub-graphs - orientdb

My graph contains an undirected topological network data and my goal is to build a query that finds all sub networks that apply to specific networking rules, create vertex for each subnetwork and connect those who have path between them. The intention is to minimize the big graph by replacing each subnetwork-subgraph in one vertex.
To find all subnetworks I took 'connected components' query from gremlin recopies
And added my networking rules to the stopping conditions. But right now I'm having hard time connecting this sub network to each other.
I'm providing here sample graph script (using different networking domain) that contains PC, Routers and other equipment nodes. Query should find all LANs by grouping connected PCs, and for each LAN return other LAN ids that have path to it.
Direction has no meaning in this graph, and path between subgraphs may contain many types of nodes (routers, equipment etc.).
My GraphDB is OrientDB.
Networking Graph Image
Result should look like this:
==>LAN 1: {pcs: [1, 2, 3], connected LANs: [LAN 2, LAN 3]}
==>LAN 2: {pcs: [4, 5, 6], connected LANs: [LAN 1]}
==>LAN 3: {pcs: [8, 7], connected LANs: [LAN 1]}
This is query's first part (finding all sub networks):
g.V().hasLabel('PC').emit(cyclicPath().or().not(both())).
repeat(__.where(without('a')).store('a').both()).until(or(cyclicPath(), hasLabel('Router'))).
group().by(path().unfold().limit(1)).
by(path().local(unfold().filter(hasLabel('PC')).values('id')).unfold().dedup().fold()).unfold()
My questions are:
I can identify connectivity between sub networks by traversing some arbitrary node from every sub network till I reach node that exist on other sub network. How do I write it in gremlin?
How can I create new graph out of this query results?
What is the performance of this type of query in a big graph, say 30M nodes?
Create graph script:
g = TinkerGraph.open().traversal()
g.addV("PC").property("id","1").as("pc1").
addV("PC").property("id","2").as("pc2").
addV("PC").property("id","3").as("pc3").
addV("PC").property("id","4").as("pc4").
addV("PC").property("id","5").as("pc5").
addV("PC").property("id","6").as("pc6").
addV("PC").property("id","7").as("pc7").
addV("PC").property("id","8").as("pc8").
addV("Router").property("id","9").as("router1").
addV("Router").property("id","10").as("router2").
addV("Equipment").property("id","11").as("eq1").
addV("Equipment").property("id","12").as("eq2").
addV("Equipment").property("id","13").as("eq3").
addV("Equipment").property("id","14").as("eq4").
addE("Line").from("pc1").to("pc2").
addE("Line").from("pc1").to("eq3").
addE("Line").from("pc2").to("pc3").
addE("Line").from("pc3").to("eq1").
addE("Line").from("pc3").to("eq3").
addE("Line").from("pc4").to("pc5").
addE("Line").from("pc4").to("pc6").
addE("Line").from("pc5").to("pc6").
addE("Line").from("pc7").to("pc8")
addE("Line").from("router1").to("pc7").
addE("Line").from("router1").to("pc8").
addE("Line").from("router1").to("eq2").
addE("Line").from("router2").to("eq4").
addE("Line").from("eq1").to("router1").
addE("Line").from("eq3").to("router2").
addE("Line").from("eq4").to("pc4").
iterate()

This isn't a great answer because I think that I have to jump to your last question and ignore the first two of the three:
What is the performance of this type of query in a big graph, say 30M nodes?
If you modified the "Connected Component" recipe found here then I assume you read further down about the general expense of this sort of query for both OLTP and OLAP. I'd imagine that for 30M vertices you should be looking at OLAP-based processing (as opposed to that script you presented above). I suppose you might be able to do it with TinkerGraph/GraphComputer on a large enough machine with a lot of memory, but this might just be a job for SparkGraphComputer as suggested toward the end of the recipe.
I think that your first two questions seem to depend on your approach to and success around the third question and that those initial questions might get more focused or even change a bit once you get that far. Perhaps it would be best to try to get your OLAP approach to "connected components" settled and then come back with some more specific questions.

Related

OR-Tools: Difference between Path Cheapest Arc and Global Cheapest Arc

The OR-Tools documentation (https://developers.google.com/optimization/routing/routing_options) for the VRP routing options describes the two first solution strategies the following way:
PATH_CHEAPEST_ARC: Starting from a route "start" node, connect it to the node which produces the cheapest route segment, then extend the route by iterating on the last node added to the route.
GLOBAL_CHEAPEST_ARC: Iteratively connect two nodes which produce the cheapest route segment.
Can someone explain me what the difference between the two heuristics is? Unfortunately I haven't found any other information on the internet or the documentation.
The first one grows a route by extending it.
The second one connect the closest nodes together until it creates routes.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

Optimizing a Prefix Tree in OrientDB

In my project, I have a fairly large prefix tree, potentially containing millions of nodes (about 250K nodes in my development instance), managed in OrientDB (pointing to other vertices in my graph).
The nodes of the prefix tree are represented by a Token vertex type. Each Token has a 'key' property and is connected to its child vertices by a 'child' edge type. So, a sequence like "hello world" would be represented as:
root -child-> "hello" -child-> "world"
Currently, I have a NOTUNIQUE_HASH_INDEX on Token.key and I am querying the data structure like this:
SELECT EXPAND(OUT('child')[key=:k]) FROM :p
where k is the child key I am looking for and p is the RID of the parent node.
Generally, performance is pretty good, but I am looking for ideas on improving the query, the indexing, or both for this use case. In particular, queries starting at the root node, which has many children, take noticeably longer than the other, less-connected nodes.
Any suggestions? Thanks in advance!
Luigi Dell'Aquila from the OrientDB team provided an excellent answer on the OrientDB Google Group. To summarize, the following query (suggested by Luigi) dramatically improved performance.
SELECT FROM Token where key = :k AND in('Child') contains :p
I just ran a realistic test and query time was reduced by 97%! See https://groups.google.com/forum/#!topic/orient-database/mUkz6Z7hSwk for more details.

some questions about designing on OrientDB

We were looking for the most suitable database for our innovative “collaboration application”. Sorry, we don’t know how to name it in a way generally understood. In fact, highly complicated relationships among tenants, roles, users, tasks and bills need to be handled effectively.
After reading 5 DBs(Postgrel, Mongo, Couch, Arango and Neo4J), when the words “… relationships among things are more important than things themselves” came to my eyes, I made up my mind to dig into OrientDB. Both the design philosophy and innovative features of OrientDB (multi-models, cluster, OO,native graph, full graph API, SQL-like, LiveQuery, multi-masters, auditing, simple RID and version number ...) keep intensifying my enthusiasm.
OrientDB enlightens me to re-think and try to model from a totally different viewpoint!
We are now designing the data structure based on OrientDB. However, there are some questions puzzling me.
LINK vs. EDGE
Take a case that a CLIENT may place thousands of ORDERs, how to choose between LINKs and EDGEs to store the relationships? I prefer EDGEs, but they seem like to store thousands of RIDs of ORDERs in the CLIENT record.
Embedded records’ Security
Can an embedded record be authorized independently from it’s container record?
Record-level Security
How does activating Record-level Security affect the query performance?
Hope I express clearly. Any words will be truly appreciated.
LINK vs EDGE
If you don't have properties on your arch you can use a link, instead if you have it use edges. You really need edges if you need to traverse the relationship in both directions, while using the linklist you can only in one direction (just like a hyperlink on the web), without the overhead of edges. Edges are the right choice if you need to walk thru a graph.Edges require more storage space than a linklist. Another difference between them it's the fact that if you have two vertices linked each other through a link A --> (link) B if you delete B, the link doesn't disappear it will remain but without pointing something. It is designed this way because when you delete a document, finding all the other documents that link to it would mean doing a full scan of the database, that typically takes ages to complete. The Graph API, with bi-directional links, is specifically designed to resolve this problem, so in general we suggest customers to use that, or to be careful and manage link consistency at application level.
RECORD - LEVEL SECURITY
Using 1 Million vertex and an admin user called Luke, doing a query like: select from where title = ? with an NOT_UNIQUE_HASH_INDEX the execution time it has been 0.027 sec.
OrientDB has the concept of users and roles, as well as Record Level Security. It also supports token based authentication, so it's possible to use OrientDB as your primary means of authorizing/authenticating users.
EMBEDDED RECORD'S SECURITY
I've made this example for trying to answer to your question
I have this structure:
If I want to access to the embedded data, I have to do this command: select prop from User
Because if I try to access it through the class that contains the type of car I won't have any type of result
select from Car
UPDATE
OrientDB supports that kind of authorization/authentication but it's a little bit different from your example. For example: if an user A, without admin permission, inserts a record, another user B can't see the record inserted by user A without admin permission. An User can see only the records that has inserted.
Hope it helps

Is it possible to connect two separate nodes by two links?

I am wondering if it is possible to do this as I am trying to build a traffic simulation model and may need to utilise this feature , should it exist, in my model.
There two, and only two, conditions under which a pair of turtles may be connected by more than one link:
If the links are directed, you can have two links, going in opposite directions.
If the links are different breeds.
You might consider alternatives like having a single link but adding a links-own variable to the links containing a weight, count, or other information.