Best Way to Store/Access a Directed Graph - rdbms

I have around 3500 flood control facilities that I would like to represent as a network to determine flow paths (essentially a directed graph). I'm currently using SqlServer and a CTE to recursively examine all the nodes and their upstream components and this works as long as the upstream path doesn't fork alot. However, some queries take exponentially longer than others even when they are not much farther physically down the path (i.e. two or three segments "downstream") because of the added upstream complexity; in some cases I've let it go over ten minutes before killing the query. I'm using a simple two-column table, one column being the facility itself and the other being the facility that is upstream from the one listed in the first column.
I tried adding an index using the current facility to help speed things up but that made no difference. And, as for the possible connections in the graph, any nodes could have multiple upstream connections and could be connected to from multiple "downstream" nodes.
It is certainly possible that there are cycles in the data but I have not yet figured out a good way to verify this (other than when the CTE query reported a maximum recursive count hit; those were easy to fix).
So, my question is, am I storing this information wrong? Is there a better way other than a CTE to query the upstream points?

The best way to store graphs is of course to use a native graph db :-)
Take a look at neo4j.
It's implemented in Java and has Python and Ruby bindings as well.
I wrote up two wiki pages with simple examples of domain models represented as graphs using neo4j: assembly and roles. More examples are found on the domain modeling gallery page.

I know nothing about flood control facilities. But I would take the first facility. And use a temp table and a while loop to generate the path.
-- Pseudo Code
TempTable (LastNode, CurrentNode, N)
DECLARE #intN INT
SET #intN = 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
-- Insert first item in list with no up stream items...call this initial condition
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE node has nothing upstream
WHILE #intN <= 3500
BEGIN
SEt #intN = #intN + 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE LastNode IN (SELECT CurrentNode FROM TempTable WHERE N = #intN-1)
IF ##ROWCOUNT = 0
BREAK
END
If we assume that every node points to one child. Then this should take no longer than 3500 iterations. If multiple nodes have the same upstream provider then it will take less. But more importantly, this lets you do this...
SELECT LastNode, CurrentNode, N
FROM TempTable
ORDER BY N
And that will let you see if there are any loops or any other issues with your provider. Incidentally 3500 rows is not that much so even in the worst case of each provider pointing to a different upstream provider, this should not take that long.

Traditionally graphs are either represented by a matrix or a vector. The matrix takes more space, but is easier to process(3500x3500 entries in your case); the vector takes less space(3500 entries, each have a list of who they connect to).
Does that help you?

i think your data structure is fine (for SQL Server) but a CTE may not be the most efficient solution for your queries. You might try making a stored procedure that traverses the graph using a temp table as a queue instead, this should be more efficient.
the temp table can also be used to eliminate cycles in the graph, though there shouldn't be any

Yes (maybe). Your data set sounds relatively small, you could load the graph to memory as an adjacency matrix or adjacency list and query the graph directly - assuming you program.
As far as on-disk format, DOT is fairly portable/popular among others. It also seems pretty common to store a list of edges in a flat file format like:
vertex1 vertex2 {edge_label1}+
Where the first line of the file contains the number of vertices in the graph, and every line after that describes edges. Whether the edges are directed or undirected is up to the implementor. If you want explicit directed edges, then describe them using directed edges like:
vertex1 vertex2
vertex2 vertex1

My experiences with storing something like you described in a SQL Server database:
I was storing a distance matrix, telling how long does it take to travel from point A to point B. I have done the naive representation and stored them directly into a table called distances with columns A,B,distance,time.
This is very slow on simple retreival. I found it is lot better to store my whole matrix as text. Then retreive it into memory before the computations, create an matrix struxture in memory and work with it there.
I could provide with some code, but it would be C#.

Related

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

what is the best way to reduce complexity of geometries

so I'm playing around with the http://www.gadm.org/ dataset;
I want to go from lat & lon to a country and state (or equivalent).
So to simplify the data I'm grouping it up and unioning the geometies; so far so good. the results are great, I can pull back Belgium and it is fine.
I pull back australia and I get victoria because the thing is too damn large.
Now I honestly don't care too much about the level of detail; if lines are w/in 1 km of where they should be I'm ok (as long as shapes are bigger, not smaller)
What is the best approach to reduce the complexity of my geospatial objects so I end up with a quick and simple view of the world?
All data is kept as Geometry data.
As you've tagged the question with "tsql" I'm assuming you're using Sql Server. Thus, you already have an handy function called Reduce which you can apply on the geometry data type to simplify it.
For example (copied from the above link):
DECLARE #g geometry;
SET #g = geometry::STGeomFromText('LINESTRING(0 0, 0 1, 1 0, 2 1, 3 0, 4 1)', 0);
SELECT #g.Reduce(.75).ToString();
The function receives a tolerance argument with the simplification threshold.
I suppose complexity is determined only by the number of vertices in a shape. There are quite a number of shape simplifying algorithms to choose from (and maybe some source too).
As a simplistic approach, you can iterate over vertices and reject concave ones if the result does not intoduce an error too large (e.g. in terms of added area), preferably adjoining smaller segments into larger. A more sophisticated approach might break an existing segment to better remove smaller ones.

How to handle new data for recommendation system?

Here's a theoretical question. Let's assume that I have implemented two types of collaborative filtering: user-based CF and item-based CF (in the form of Slope One).
I have a nice data set for these algorithms to run on. But then I want to do two things:
I'd like to add a new rating to the data set.
I'd like to edit an existing rating.
How should my algorithms handle these changes (without doing a lot of unnecessary work)? Can anyone help me with that?
For both cases, the strategy is very similar:
user-based CF:
update all similarities for the affected user (that is, one row and one column in the similarity matrix)
if your neighbors are precomputed, compute the neighbors for the affected user (for a complete update, you may have to recompute all neighbors, but I would stick with the approximate solution)
Slope-One:
update the frequency (only in the 'add' case) and the diff matrix entries for the affected item (again, one row and one column)
Remark: If your 'similarity' is asymmetric, you need to update one row and one column. If it is symmetric, updating one row automatically results in updating the corresponding column.
For Slope-One, the matrices are symmetric (frequency) and skew symmetric (diffs), so if you handle you also need to update one row or column, and get the other one for free (if your matrix storage works like this).
If you want to see an example of how this could be implemented, have a look at MyMediaLite (disclaimer: I am the main author): https://github.com/zenogantner/MyMediaLite/blob/master/src/MyMediaLite/RatingPrediction/ItemKNN.cs
The interesting code is in the method RetrainItem(), which is called from AddRatings() and UpdateRatings().
The general thing are called online algorithms.
Instead of retraining the whole predictor, it can be updated "online" (while remaining useable) with the new data only.
If you google for "online slope one predictor" you should be able to find some relevant approaches from literature.

Spatial index slowing down query

Background
I have a table that contains POLYGONS/MULTIPOLYGONS which represent customer territories:
The table contains roughly 8,000 rows
Approximately 90% of the polygons are circles
The remainder of the polygons represent one or more states, provinces, or other geographic regions. The raw polygon data for these shapes was imported from US census data.
The table has a spatial index and a clustered index on the primary key. No changes to the default SQL Server 2008 R2 settings were made. 16 cells per object, all levels medium.
Here's a simplified query that will reproduce the issue that I'm experiencing:
DECLARE #point GEOGRAPHY = GEOGRAPHY::STGeomFromText('POINT (-76.992188 39.639538)', 4326)
SELECT terr_offc_id
FROM tbl_office_territories
WHERE terr_territory.STIntersects(#point) = 1
What seems like a simple, straightforward query takes 12 or 13 seconds to execute, and has what seems like a very complex execution plan for such a simple query.
In my research, several sources have suggested adding an index hint to the query, to ensure that the query optimizer is properly using the spatial index. Adding WITH(INDEX(idx_terr_territory)) has no effect, and it's clear from the execution plan that it is referencing my index regardless of the hint.
Reducing polygons
It seemed possible that the territory polygons imported from the US Census data are unnecessarily complex, so I created a second column, and tested reduced polygons (w/ Reduce() method) with varying degrees of tolerance. Running the same query as above against the new column produced the following results:
No reduction: 12649ms
Reduced by 10: 7194ms
Reduced by 20: 6077ms
Reduced by 30: 4793ms
Reduced by 40: 4397ms
Reduced by 50: 4290ms
Clearly headed in the right direction, but dropping precision seems like an inelegant solution. Isn't this what indexes are supposed to be for? And the execution plan still seems strangly complex for such a basic query.
Spatial Index
Out of curiosity, I removed the spatial index, and was stunned by the results:
Queries were faster WITHOUT an index (sub 3 sec w/ no reduction, sub 1 sec with reduction tolerance >= 30)
The execution plan looked far, far simpler:
My questions
Why is my spatial index slowing things down?
Is reducing my polygon complexity really necessary in order to speed up my query? Dropping precision could cause problems down the road, and doesn't seem like it will scale very well.
Other Notes
SQL Server 2008 R2 Service Pack 1 has been applied
Further research suggested running the query inside a stored procedure. Tried this and nothing appeared to change.
My first thoughts are to check the bounding coordinates of the index; see if they cover the entirety of your geometries. Second, spatial indexes left at the default 16MMMM, in my experience, perform very poorly. I'm not sure why that is the default. I have written something about the spatial index tuning on this answer.
First make sure the index covers all of the geometries. Then try reducing cells per object to 8. If neither of those two things offer any improvement, it might be worth your time to run the spatial index tuning proc in the answer I linked above.
Final thought is that state boundaries have so many vertices and having many state boundary polygons that you are testing for intersection with, it very well could take that long without reducing them.
Oh, and since it has been two years, starting in SQL Server 2012, there is now a GEOMETRY_AUTO_GRID tessellation that does the index tuning for you and does a great job most of the time.
This might just be fue to the simpler execution plan being executed in parallel, whereas the other one is not. However, there is a warning on the first execution plan that might be worth investigating.