what is the best way to reduce complexity of geometries - tsql

so I'm playing around with the http://www.gadm.org/ dataset;
I want to go from lat & lon to a country and state (or equivalent).
So to simplify the data I'm grouping it up and unioning the geometies; so far so good. the results are great, I can pull back Belgium and it is fine.
I pull back australia and I get victoria because the thing is too damn large.
Now I honestly don't care too much about the level of detail; if lines are w/in 1 km of where they should be I'm ok (as long as shapes are bigger, not smaller)
What is the best approach to reduce the complexity of my geospatial objects so I end up with a quick and simple view of the world?
All data is kept as Geometry data.

As you've tagged the question with "tsql" I'm assuming you're using Sql Server. Thus, you already have an handy function called Reduce which you can apply on the geometry data type to simplify it.
For example (copied from the above link):
DECLARE #g geometry;
SET #g = geometry::STGeomFromText('LINESTRING(0 0, 0 1, 1 0, 2 1, 3 0, 4 1)', 0);
SELECT #g.Reduce(.75).ToString();
The function receives a tolerance argument with the simplification threshold.

I suppose complexity is determined only by the number of vertices in a shape. There are quite a number of shape simplifying algorithms to choose from (and maybe some source too).
As a simplistic approach, you can iterate over vertices and reject concave ones if the result does not intoduce an error too large (e.g. in terms of added area), preferably adjoining smaller segments into larger. A more sophisticated approach might break an existing segment to better remove smaller ones.

Related

Determine in which polygons a point is

I have tremendous flows of point data (in 2D) (thousands every second). On this map I have several fixed polygons (dozens to a few hundreds of them).
I would like to determine in real time (the order of a few milliseconds on a rather powerful laptop) for each point in which polygons it lies (polygons can intersect).
I thought I'd use the ray casting algorithm.
Nevertheless, I need a way to preprocess the data, to avoid scanning every polygon.
I therefore consider using tree approaches (PM quadtree or Rtree ?). Is there any other relevant method ?
Is there a good PM Quadtree implementation you would recommend (in whatever language, preferably C(++), Java or Python) ?
I have developed a library of several multi-dimensional indexes in Java, it can be found here. It contains R*Tree, STR-Tree, 4 quadtrees (2 for points, 2 for rectangles) and a critbit tree (can be used for spatial data by interleaving the coordinates). I also developed the PH-Tree.
There are all rectange/point based trees, so you would have to convert your polygons into rectangles, for example by calculating the bounding box. For all returned bounding boxes you would have to calculate manually if the polygon really intersects with your point.
If your rectangles are not too elongated, this should still be efficient.
I usually find the PH-Tree the most efficient tree, it has fast building times and very fast query times if a point intersects with 100 rectangles or less (even better with 10 or less). STR/R*-trees are better with larger overlap sizes (1000+). The quadtrees are a bit unreliable, they have problems with numeric precision when inserting millions of elements.
Assuming a 3D tree with 1 million rectangles and on average one result per query, the PH-Tree requires about 3 microseconds per query on my desktop (i7 4xxx), i.e. 300 queries per millisecond.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Keeping track of moves when pre-processing graphs

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

Insert a circle into geometry data type

I'm about to start using geometry or geography data types for the first time now that we have a development baseline of 2008R2 (!)
I'm struggling to find how to store the representation for a circle. We currently have the lat and long of the centre of the circle along with the radius, something like :-
[Lat] [float] NOT NULL,
[Long] [float] NOT NULL,
[Radius] [decimal](9, 4) NOT NULL,
Does anyone know the equivalent way to store this using the STGeomFromText method, ie which Well-Known Text (WKT) representation to use? I've looked at circular string (LINESTRING) and curve, but can't find any examples....
Thanks.
One thing you can do if you are using SQL Server 2008 is to buffer a point and store the resulting Polygon (as well-known binary, internally). For example,
declare #g geometry
set #g=geometry::STGeomFromText('POINT(0 0)', 4326).STBuffer(1)
select #g.ToString()
select #g.STNumPoints()
select #g.STArea()
This outputs, the WKT,
POLYGON ((0 -1, 0.051459848880767822 -0.99869883060455322, 0.10224419832229614 -0.99483710527420044, 0.15229016542434692 -0.98847776651382446, 0.20153486728668213 -0.97968357801437378, 0.24991559982299805 -0.96851736307144165,... , 0 -1))
the number of points, 129, from which it can be seen that buffering a circle uses 128 points plus a repeated start point and and the area, 3.1412, which is accurate to 3 decimal places, and differs from the real value by 0.01%, which would be acceptable for many use cases.
If you want less accuracy (ie, less points), you can use the Reduce function to decrease the number of points, eg,
declare #g geometry
set #g=geometry::STGeomFromText('POINT(0 0)', 4326).STBuffer(1).Reduce(0.01)
which now produces a circle approximation with 33 points and an area of 3.122 (now 0.6% less than the real value of PI).
Less points will reduce storage and make queries such as STIntersects and STIntersection faster, but, obviously, at the cost of accuracy.
EDIT 1: As Jon Bellamy has pointed out, if you choose to use the Reduce function, the parameter needs to be scaled proportionally to the circle/buffer radius, as it is a sensitivity factor for removing points, based on the Ramer-Douglas-Peucker algorithm
EDIT 2: There is also a function, BufferWithTolerance, which can be used to approximate a circle with a polygon. The second parameter, tolerance effects how close this approximation will be: the lower the value, the more points and better approximation. The 3rd parameter is a bit, indicating whether the tolerance is relative or absolute in relation to the buffer radius. This function could be used instead of the STBuffer, Reduce combination to create a circle with more points.
The following query produces,
declare #g geometry
set #g=geometry::STGeomFromText('POINT(0 0)', 4326).BufferWithTolerance(1,0.0001,1)
select #g.STNumPoints()
select #g.STArea()
a "circle" of 321 points with an area of 3.1424, ie, within 0.02% of the true value of PI (but now larger) and actually less accurate than the simple buffer above. Increasing the tolerance further does not lead to any significant improvement in accuracy, which suggests there is an upper limit to this approach.
As MvG has said, there is no CircularString or CompoundCurve until SQL Server 2012, which would allow you to store circles more compactly and accurately, by building a CompoundCurve made up of two semi-circles, ie, using two CircularStrings.
As far as I can tell from the docs, CircularString was only added for SQL Server 2012. The only other instantiable curve appears to be LineString which, as the name suggests, encodes a sequence of line segments. So your best bet would be approximating the circle as a (possibly regular) polygon with a sufficient number of corners. If that is not acceptable, you might have to keep your current data structures in place, either exclusively or in addition to spatial data types to verify that a match there indeed matches the circle.
This answer was written purely from the docs, with no experience to support it.

Best Way to Store/Access a Directed Graph

I have around 3500 flood control facilities that I would like to represent as a network to determine flow paths (essentially a directed graph). I'm currently using SqlServer and a CTE to recursively examine all the nodes and their upstream components and this works as long as the upstream path doesn't fork alot. However, some queries take exponentially longer than others even when they are not much farther physically down the path (i.e. two or three segments "downstream") because of the added upstream complexity; in some cases I've let it go over ten minutes before killing the query. I'm using a simple two-column table, one column being the facility itself and the other being the facility that is upstream from the one listed in the first column.
I tried adding an index using the current facility to help speed things up but that made no difference. And, as for the possible connections in the graph, any nodes could have multiple upstream connections and could be connected to from multiple "downstream" nodes.
It is certainly possible that there are cycles in the data but I have not yet figured out a good way to verify this (other than when the CTE query reported a maximum recursive count hit; those were easy to fix).
So, my question is, am I storing this information wrong? Is there a better way other than a CTE to query the upstream points?
The best way to store graphs is of course to use a native graph db :-)
Take a look at neo4j.
It's implemented in Java and has Python and Ruby bindings as well.
I wrote up two wiki pages with simple examples of domain models represented as graphs using neo4j: assembly and roles. More examples are found on the domain modeling gallery page.
I know nothing about flood control facilities. But I would take the first facility. And use a temp table and a while loop to generate the path.
-- Pseudo Code
TempTable (LastNode, CurrentNode, N)
DECLARE #intN INT
SET #intN = 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
-- Insert first item in list with no up stream items...call this initial condition
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE node has nothing upstream
WHILE #intN <= 3500
BEGIN
SEt #intN = #intN + 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE LastNode IN (SELECT CurrentNode FROM TempTable WHERE N = #intN-1)
IF ##ROWCOUNT = 0
BREAK
END
If we assume that every node points to one child. Then this should take no longer than 3500 iterations. If multiple nodes have the same upstream provider then it will take less. But more importantly, this lets you do this...
SELECT LastNode, CurrentNode, N
FROM TempTable
ORDER BY N
And that will let you see if there are any loops or any other issues with your provider. Incidentally 3500 rows is not that much so even in the worst case of each provider pointing to a different upstream provider, this should not take that long.
Traditionally graphs are either represented by a matrix or a vector. The matrix takes more space, but is easier to process(3500x3500 entries in your case); the vector takes less space(3500 entries, each have a list of who they connect to).
Does that help you?
i think your data structure is fine (for SQL Server) but a CTE may not be the most efficient solution for your queries. You might try making a stored procedure that traverses the graph using a temp table as a queue instead, this should be more efficient.
the temp table can also be used to eliminate cycles in the graph, though there shouldn't be any
Yes (maybe). Your data set sounds relatively small, you could load the graph to memory as an adjacency matrix or adjacency list and query the graph directly - assuming you program.
As far as on-disk format, DOT is fairly portable/popular among others. It also seems pretty common to store a list of edges in a flat file format like:
vertex1 vertex2 {edge_label1}+
Where the first line of the file contains the number of vertices in the graph, and every line after that describes edges. Whether the edges are directed or undirected is up to the implementor. If you want explicit directed edges, then describe them using directed edges like:
vertex1 vertex2
vertex2 vertex1
My experiences with storing something like you described in a SQL Server database:
I was storing a distance matrix, telling how long does it take to travel from point A to point B. I have done the naive representation and stored them directly into a table called distances with columns A,B,distance,time.
This is very slow on simple retreival. I found it is lot better to store my whole matrix as text. Then retreive it into memory before the computations, create an matrix struxture in memory and work with it there.
I could provide with some code, but it would be C#.