Postgis ST_Split not splitting roads by all points - postgresql

I have a table containing the street network of Chicago, and I also have a table of crimes committed in Chicago. I am trying to to create k-means clusters for the crimes by assigning them to the cluster centre that is the shortest road distance away.
Firstly, I interpolated all the crimes onto the closest road. So far, so good.
Now what I'm trying to do is to split each road by all the crime points that fall on it, so that I can then create a network topology using pgrouting and route from one crime location to another.
Problem is, the ST_Split function does not seem to be splitting most of the roads and I have no idea why. Given I have a million crime points, the roads should be split into a large amount of segments, but I'm only getting about a thousand rows more than there are in the original street network table. This is the command I'm using:
CREATE TABLE algorithms.crime_network AS
SELECT road.id AS road_id, (ST_Dump(ST_Split(road.geom, road.crime_points))).geom
FROM (SELECT r.geom as geom, r.gid id, ST_Multi(ST_Collect(c.geom)) AS crime_points FROM public.transportation r INNER JOIN chicago_data.interpolated_crimes c ON c.road_id = r.gid GROUP BY r.gid) AS road;
I am using Postgis version 2.2.2 so the fact that I'm splitting by multipoints isn't a problem..
Any help would be appreciated!

As commented, some functions like all of the overlay operators and ST_Split require exact noding to perform as expected. This means that annoying floating point differences of geometry overlays will have vertexes from different geometries very close to each other (on the order of <1e-12), but not exact.
Use ST_Snap to get exact noding of one geometry on another, which will help functions like ST_Split operate as expected.

Related

pgRouting with custom network?

I have a cost network, but it's not a street mapping network. I know the nodes and edges as I defined them. pgRouting looks like a good choice, but every single example I can find uses Open Street Map as the data. I don't have GPS coordinates. The x1,y1 for nodes makes no sense in my graphs, my nodes have specific ids, not coordinates. The costs aren't calculated from the coordinates, they're assigned by me on the various edges based on domain knowledge specific to my domain.
Are there any examples of how to create a custom network in pgRouting? I'm really struggling because the examples are "and then you use this tool to import OSM data"...which doesn't help me at all.
#Chris Kessel
I don't know if this is still relevant, but it may help others:
Basically, what you need to have is a table with edges, where in column 'source' is the id of a node on one end of the edge and in column 'target' - id of the node on the other end. You also have to have a defined cost for the edge, I'm not sure what this will be for you - usually it's distance or time units.
Ususally this is done with geo info using pgr_createTopology function, but in your case you will need to just create this yourself, I suppose.
I think this link can help you:
https://anitagraser.com/2011/02/07/a-beginners-guide-to-pgrouting/
The answer to the question "Are there any examples of how to create a custom network in pgRouting?" is Yes there are.

Openstreetmaps - continuous road?

I imported the OSM data for Switzerland in Postgres and I am interested in getting the road data of a continuous part of a highway (I know the name),that is, the part that connects two specific cities. The highway is quite big (A1) and connects a lot of cities together.
I am not sure how the sequence of road segments is stored in postgres (ie, how one knows that one road segment is directly after the other). How should I query Postgres to get a linestring with the route from on city to another? I can visualize the data of the whole highway (which spans multiple cities) in QuantumGis by doing the query:
select osm_id,way from planet_osm_roads where highway='motorway' and ref='A1';
but I don't know how to only get the osm_ids that I am interested in, in the order they appear in the route. I do not want to do a bounding box constraint in the where clauses because I am looking for a general solution and also, I am still not sure how the order of the sequence of road segments is saved.
The way I did that was to use pgrouting, namely, their pgr_dijkstra algorithm. I loaded the OSM data into a format fit for use by pgrouting using the osm2pgrouting tool.

Find points near LineString in mongodb sorted by distance

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?
As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer
As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
and
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

What SRID should I use for my application and how?

I'm using PostgreSQL with PostGIS. All my data has already decimal lat/long attached to it (i.e. -87.34554 33.12321) but to use PostGIS I need to convert it to a certain type of SRID.
The majority of my queries are looking for data inside a certain radius.
What SRID should I use? I created already a geometry column with SRID 4269.
In this example:
link text the author is converting SRID 4269 to SRID 32661. I'm very confused about how and when to use these SRIDs. Any lite on the subject would be truly appreciated.
As long as you never intend to reproject/transform the data to another coordinate system, it doesn't technically matter what srid you use. However assuming you don't want to throw away that important metadata, and you do want to transform it, you will want to ensure your assigned srid matches the data, so postgis knows what to do when the time comes.
So why would you want to reproject from epsg:4269? The answer is because certain types of queries (such as distance) make no sense in this 'unprojected' world. Your units are in decimal degrees, and a straight measurement of x decimal degrees is a different real distance depending where in the planet you are.
In your example above, someone is using epsg:32661 as they believe it will give them better accuracy for the are they're working in. If your data is in a specific area of the globe, you can select a projection that's accurate for that area. If it spans the entire globe, you have to choose a projection that does 'ok' for your needs.
Now fortunately PostGIS has a few ways of making all this easier. For approx distances you can just use the st_distance_sphere function which, as you might guess, assumes the earth is a sphere. Or the more accurate st_distance_spheroid. Using these, you don't need to reproject and you will probably be fine for your distance queries except in edge cases. Newer versions of PostGIS also let you use geography columns
tl;dr - use st_distance_spheroid for your distance queries, store your data in geography columns, or transform it to a local projection (when storing, or on the fly, depending on your needs).
Take a look at this question: How do you know what SRID to use for a shp file?
The SRID is just a way of storing the WKT inside the database (you may have noticed that, altough you store lat/long points, the preferred storing is a long string with number and capital letters).
The SRID or EPSG can be different for the country/state/... altough there are some very common ones especially the 2 mentioned by you. If you need specific info what area uses what SRID, there is a database for handling that.
Inside your database, you have a table spatial_ref_sys that has the information on what SRID PostGIS knows about.

Best Way to Store/Access a Directed Graph

I have around 3500 flood control facilities that I would like to represent as a network to determine flow paths (essentially a directed graph). I'm currently using SqlServer and a CTE to recursively examine all the nodes and their upstream components and this works as long as the upstream path doesn't fork alot. However, some queries take exponentially longer than others even when they are not much farther physically down the path (i.e. two or three segments "downstream") because of the added upstream complexity; in some cases I've let it go over ten minutes before killing the query. I'm using a simple two-column table, one column being the facility itself and the other being the facility that is upstream from the one listed in the first column.
I tried adding an index using the current facility to help speed things up but that made no difference. And, as for the possible connections in the graph, any nodes could have multiple upstream connections and could be connected to from multiple "downstream" nodes.
It is certainly possible that there are cycles in the data but I have not yet figured out a good way to verify this (other than when the CTE query reported a maximum recursive count hit; those were easy to fix).
So, my question is, am I storing this information wrong? Is there a better way other than a CTE to query the upstream points?
The best way to store graphs is of course to use a native graph db :-)
Take a look at neo4j.
It's implemented in Java and has Python and Ruby bindings as well.
I wrote up two wiki pages with simple examples of domain models represented as graphs using neo4j: assembly and roles. More examples are found on the domain modeling gallery page.
I know nothing about flood control facilities. But I would take the first facility. And use a temp table and a while loop to generate the path.
-- Pseudo Code
TempTable (LastNode, CurrentNode, N)
DECLARE #intN INT
SET #intN = 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
-- Insert first item in list with no up stream items...call this initial condition
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE node has nothing upstream
WHILE #intN <= 3500
BEGIN
SEt #intN = #intN + 1
INSERT INTO TempTable(LastNode, CurrentNode, N)
SELECT LastNode, CurrentNode, #intN
FROM your table
WHERE LastNode IN (SELECT CurrentNode FROM TempTable WHERE N = #intN-1)
IF ##ROWCOUNT = 0
BREAK
END
If we assume that every node points to one child. Then this should take no longer than 3500 iterations. If multiple nodes have the same upstream provider then it will take less. But more importantly, this lets you do this...
SELECT LastNode, CurrentNode, N
FROM TempTable
ORDER BY N
And that will let you see if there are any loops or any other issues with your provider. Incidentally 3500 rows is not that much so even in the worst case of each provider pointing to a different upstream provider, this should not take that long.
Traditionally graphs are either represented by a matrix or a vector. The matrix takes more space, but is easier to process(3500x3500 entries in your case); the vector takes less space(3500 entries, each have a list of who they connect to).
Does that help you?
i think your data structure is fine (for SQL Server) but a CTE may not be the most efficient solution for your queries. You might try making a stored procedure that traverses the graph using a temp table as a queue instead, this should be more efficient.
the temp table can also be used to eliminate cycles in the graph, though there shouldn't be any
Yes (maybe). Your data set sounds relatively small, you could load the graph to memory as an adjacency matrix or adjacency list and query the graph directly - assuming you program.
As far as on-disk format, DOT is fairly portable/popular among others. It also seems pretty common to store a list of edges in a flat file format like:
vertex1 vertex2 {edge_label1}+
Where the first line of the file contains the number of vertices in the graph, and every line after that describes edges. Whether the edges are directed or undirected is up to the implementor. If you want explicit directed edges, then describe them using directed edges like:
vertex1 vertex2
vertex2 vertex1
My experiences with storing something like you described in a SQL Server database:
I was storing a distance matrix, telling how long does it take to travel from point A to point B. I have done the naive representation and stored them directly into a table called distances with columns A,B,distance,time.
This is very slow on simple retreival. I found it is lot better to store my whole matrix as text. Then retreive it into memory before the computations, create an matrix struxture in memory and work with it there.
I could provide with some code, but it would be C#.