Clarification on OVERPASS_MAX_QUERY_AREA_SIZE default (OSMnx, Overpass API) - openstreetmap

I am using OSMnx to query the Overpass API. I've noticed that it has a fairly large default for minimum area size:
OVERPASS_MAX_QUERY_AREA_SIZE = 50*1000*50*1000
This value is used to subdivide "larger" polygons into chunks to submit to the Overpass API.
I'd like to understand why the area is so large. For example, the entirety of San Francisco (~50 sq miles) is "simplified" to a single query.
Key questions:
Is there any advantage to reducing query sizes submitted to the Overpass API?*
Is there any advantage to reducing the complexity of shapes/polygons being submitted to the Overpass API (that is, using rectangles with just 4 corner coordinates), versus more complex polygons?**
*Note: Example query that I would be running (looking for the ways that would constitute a walk network):
[out:json][timeout:180];(way["highway"]["area"!~"yes"]["highway"!~"cycleway|motor|proposed|construction|abandoned|platform|raceway"]["foot"!~"no"]["service"!~"private"]["access"!~"private"](37.778007,-122.445467,37.783454,-122.438958);>;);out;
**Note: This question is partially answered in this other post. That said, that question does not focus completely on the performance implications, and is not asked in the context of the variable area threshold used in OSMnx to subdivide "larger" geometries.

max_query_area_size appears to be some heuristic value someone came up after doing a number of test runs. From Overpass API side this figure has pretty much no meaning on its own.
It may be completely off for different kinds of queries or even in a different area than SF. As an example: for infrequent tags, it's usually better to go ahead with a rather large bounding box, rather than firing off a huge number of queries with tiny bounding boxes.
For some statement types, a large bounding box may cause significant longer processing time, though. In this case splitting up the area in smaller pieces may help. Some queries might even consume too much memory, which forces you to split your bounding box in smaller pieces.
As you didn't mention the kind of query you want to run, it's very difficult to provide some general advise. It's like asking for a best way to write SQL statements without providing any additional context.
Using bounding boxes instead of (poly:...) has performance advantages. If you can specify a bounding box, use the respective bounding box filter rather than providing 4 lat/lon pairs to the poly filter.

Related

Determine in which polygons a point is

I have tremendous flows of point data (in 2D) (thousands every second). On this map I have several fixed polygons (dozens to a few hundreds of them).
I would like to determine in real time (the order of a few milliseconds on a rather powerful laptop) for each point in which polygons it lies (polygons can intersect).
I thought I'd use the ray casting algorithm.
Nevertheless, I need a way to preprocess the data, to avoid scanning every polygon.
I therefore consider using tree approaches (PM quadtree or Rtree ?). Is there any other relevant method ?
Is there a good PM Quadtree implementation you would recommend (in whatever language, preferably C(++), Java or Python) ?
I have developed a library of several multi-dimensional indexes in Java, it can be found here. It contains R*Tree, STR-Tree, 4 quadtrees (2 for points, 2 for rectangles) and a critbit tree (can be used for spatial data by interleaving the coordinates). I also developed the PH-Tree.
There are all rectange/point based trees, so you would have to convert your polygons into rectangles, for example by calculating the bounding box. For all returned bounding boxes you would have to calculate manually if the polygon really intersects with your point.
If your rectangles are not too elongated, this should still be efficient.
I usually find the PH-Tree the most efficient tree, it has fast building times and very fast query times if a point intersects with 100 rectangles or less (even better with 10 or less). STR/R*-trees are better with larger overlap sizes (1000+). The quadtrees are a bit unreliable, they have problems with numeric precision when inserting millions of elements.
Assuming a 3D tree with 1 million rectangles and on average one result per query, the PH-Tree requires about 3 microseconds per query on my desktop (i7 4xxx), i.e. 300 queries per millisecond.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Find points near LineString in mongodb sorted by distance

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?
As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer
As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
and
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

What are the limitations of GeoServer and OpenLayers when showing a large number of points?

We are trying to show a map with a large number of points (ranging from 1000 up to 20000 depending on the users criteria) using OpenLayers and GeoServer. The points are stored in a PostgreSQL database.
Whilst the application seems to have little problem displaying the lower range, its practical limit seems to be around 5000 points. The SLD we are applying is also huge (listing all the points individually by criteria that isn’t the feature Id). At higher numbers, the image is not guaranteed to be returned, and the request sometimes crashes GeoServer, requiring the service to be reset.
Does anyone know if such a thing is feasible, and if so, of any configuration tips?
We have applied a btree index on the field used for filtering.
What type of layer are you adding to OpenLayers?
You could use a WMS layer rather than having the points as vector features:
http://dev.openlayers.org/docs/files/OpenLayers/Layer/WMS-js.html
GeoServer would then generate an image of the points, and would only need to pass a PNG of JPEG of a few kbs rather than geometry and styling info which would be a lot larger. You'd lose some of the client-side functionality though (mouse-over events etc.)
If you are already doing this, then there may be a separate problem. 5000 points should be fine to handle on the server.
Alternatively you may want to rethink how you are diplaying the points. 5000 points at one time sounds as though it could be very confusing. Perhaps using different sized circles to represent 10, 100, 500 points etc. would be easier in terms of processing and visualisation.