Graph projections with Gremlin and Titan - titan

I would like to extract graph projections from a graph so that I can build smaller graphs from large ones for specific use cases (in most of the cases I can think of these graph projections will have a smaller size than source graph). Consider the following simple example in my original graph (all nodes have different types aka labels):
A -> B -> C -> D
Since the path from A to D exists, in my new graph I would create the nodes A and D and the edge between them. These paths could be easily discovered with a traversal (or even with the subgraph step I think):
g.V().as('a').out('aTob').out('bToC').out('cToD').as('d').select('a', 'd');
The thing is that these kind of traversals where one does not start from a specific set of nodes (where an index would be used) are not very efficient since they require a full graph scan.
16:46:27 WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertices. For better performance, use indexes
What kind of actions can be done here so that graph projections can be accomplished in an efficient way?

Related

Why am I getting different community detection results for NetworkX and Gephi?

I've got a network with undirected, weighted edges. I'm interested in whether nodes A and B are in the same community. When I run "modularity" in Gephi, the modularity class for node A is usually, though not always, distinct from that of B. However, when I switch over to Python and run, on the exact same underlying data, either louvain_communities() (from the networkx.algorithms.community module) or community_louvain.best_partition() (from the community module), A is always in the same community as B. I've tried this at various resolutions, but keep getting similar results: the Python modules are grouping A and B significantly more often than Gephi.
My question is: What is the Gephi method doing differently? My understanding is that Gephi uses the Louvain method; the variables I can see (resolution, using weights, etc.) seem to be the same. Why the disparity?
Edit:
I was asked to provide some code for this.
Basically I have edge tuples with weights like so:
edge_tuples = [(A,B,5),(A,C,11),(B,C,41),…]
I have nodes as a list:
nodes = [A,B,C,…]
I use networkx to make a graph:
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_weighted_edges_from(edge_tuples)
If I’m using community, I get the partition like so:
partition = community.community_louvain.best_partition(G,resolution=.7)
The resolution could be whatever, but .7 is in line with what I've tried before.
In Gephi, I'm just using ordinary node and edge tables. These are generated and exported as csv's in the process of creating the edge_tuples described above (i.e. I make them both out of the same data and just export a csv before making the networkx graph), so I don't see where the underlying data would be differing, although I'm certainly open to correction.

checking if a directed graph has only single one topological sort

I'm trying to write pseudo-code for an algorithm that suppose to check whether a directed graph has only single one topological ordering. I've already come up with the pseudo-code for a topological sort (using DFS), but it does not seem to help me much. I wonder if there is no sinks in that graph -then there's not a single one topological ordering (might it help?).
This is an improvement of this answer, as the best possible runtime is improved when it starts at a vertex with no outgoing edges.
If your directed graph has N vertices, and you have exactly one starting vertex with indegree 0,
Do DFS (but only on the starting vertex) to get a topological sort L.
If L doesn't have N vertices, either some vertices are unreachable (those are part of a cycle) or you need another starting vertex (so there are multiple toposorts).
For i in [0,N-2],
If there is no edge from L[i] to L[i+1],
Return false.
Return true.
Alternatively, modify Kahn’s algorithm: if more than one vertex is in the set (of vertices of indegree 0) while the algorithm is running, return false. Otherwise, return true.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Keeping track of moves when pre-processing graphs

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

Best Way to Store/Access a Directed Graph

I have around 3500 flood control facilities that I would like to represent as a network to determine flow paths (essentially a directed graph). I'm currently using SqlServer and a CTE to recursively examine all the nodes and their upstream components and this works as long as the upstream path doesn't fork alot. However, some queries take exponentially longer than others even when they are not much farther physically down the path (i.e. two or three segments "downstream") because of the added upstream complexity; in some cases I've let it go over ten minutes before killing the query. I'm using a simple two-column table, one column being the facility itself and the other being the facility that is upstream from the one listed in the first column.
I tried adding an index using the current facility to help speed things up but that made no difference. And, as for the possible connections in the graph, any nodes could have multiple upstream connections and could be connected to from multiple "downstream" nodes.
It is certainly possible that there are cycles in the data but I have not yet figured out a good way to verify this (other than when the CTE query reported a maximum recursive count hit; those were easy to fix).
So, my question is, am I storing this information wrong? Is there a better way other than a CTE to query the upstream points?
The best way to store graphs is of course to use a native graph db :-)
Take a look at neo4j.
It's implemented in Java and has Python and Ruby bindings as well.
I wrote up two wiki pages with simple examples of domain models represented as graphs using neo4j: assembly and roles. More examples are found on the domain modeling gallery page.
I know nothing about flood control facilities. But I would take the first facility. And use a temp table and a while loop to generate the path.
-- Pseudo Code
TempTable (LastNode, CurrentNode, N)
DECLARE #intN INT
SET #intN = 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
-- Insert first item in list with no up stream items...call this initial condition
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE node has nothing upstream
WHILE #intN <= 3500
BEGIN
SEt #intN = #intN + 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE LastNode IN (SELECT CurrentNode FROM TempTable WHERE N = #intN-1)
IF ##ROWCOUNT = 0
BREAK
END
If we assume that every node points to one child. Then this should take no longer than 3500 iterations. If multiple nodes have the same upstream provider then it will take less. But more importantly, this lets you do this...
SELECT LastNode, CurrentNode, N
FROM TempTable
ORDER BY N
And that will let you see if there are any loops or any other issues with your provider. Incidentally 3500 rows is not that much so even in the worst case of each provider pointing to a different upstream provider, this should not take that long.
Traditionally graphs are either represented by a matrix or a vector. The matrix takes more space, but is easier to process(3500x3500 entries in your case); the vector takes less space(3500 entries, each have a list of who they connect to).
Does that help you?
i think your data structure is fine (for SQL Server) but a CTE may not be the most efficient solution for your queries. You might try making a stored procedure that traverses the graph using a temp table as a queue instead, this should be more efficient.
the temp table can also be used to eliminate cycles in the graph, though there shouldn't be any
Yes (maybe). Your data set sounds relatively small, you could load the graph to memory as an adjacency matrix or adjacency list and query the graph directly - assuming you program.
As far as on-disk format, DOT is fairly portable/popular among others. It also seems pretty common to store a list of edges in a flat file format like:
vertex1 vertex2 {edge_label1}+
Where the first line of the file contains the number of vertices in the graph, and every line after that describes edges. Whether the edges are directed or undirected is up to the implementor. If you want explicit directed edges, then describe them using directed edges like:
vertex1 vertex2
vertex2 vertex1
My experiences with storing something like you described in a SQL Server database:
I was storing a distance matrix, telling how long does it take to travel from point A to point B. I have done the naive representation and stored them directly into a table called distances with columns A,B,distance,time.
This is very slow on simple retreival. I found it is lot better to store my whole matrix as text. Then retreive it into memory before the computations, create an matrix struxture in memory and work with it there.
I could provide with some code, but it would be C#.