Fast method of Geospatial indexing of half million polygons in Lucene - postgresql

I am trying to find intersecting geo-hashes(upto precision length 6) of around half million polygons. For every polygon I have to find all the geohashes(upto precision length 6) inside that polygon and index it. I have tried using postgis st_geohash and st_intersect and then storing it in redis but it is very slow for my use-case. I need to index half million polygon's geohashes in 10 mins.
I read it that its possible to do so using lucene. I tried searching 'geo-spatial indexing a polygon' but couldn't find a good link.
I am a beginner in elastic search and lucene.
Kindly, tell me how to do it or point me in the right direction.
Regards,

Related

Find closest match from 2 columns in postgresql

I have a table "ways" containing coordinates (lat/lon) values. Suppose I have a coordinate (x,y) and I want to check the closest match from the table ways. I have seen some similar questions like this: Is there a postgres CLOSEST operator?
However in my case I have 2 columns instead of 1. Is there a query to do something like this this ?
You could store the data as PostGIS 'point' type instead of coordinate values:
https://postgis.net/docs/ST_MakePoint.html
https://postgis.net/docs/ST_GeomFromText.html
This would empower you with all the PostGIS functionality such as:
https://postgis.net/docs/ST_DWithin.html
Then you could create a GiST index and use the PostGIS <-> operator to take advantage of index assisted nearest neighbor result sets. The 'nearest neighbor' functionality is pretty common. Here is a great explanation:
https://postgis.net/workshops/postgis-intro/knn.html
“KNN” stands for “K nearest neighbours”, where “K” is the number of neighbours you are looking for.
KNN is a pure index based nearest neighbour search. By walking up and down the index, the search can find the nearest candidate geometries without using any magical search radius numbers, so the technique is suitable and high performance even for very large tables with highly variable data densities.

Accuracy of MongoDB Intersect Query

I read somewhere that MongoDB doesn't accurately do spatial searches, as it creates a bounding box around the objects and checks if they intersect, rather than checking whether the original shapes actually intersect. Annoyingly I can't find this webpage again.
Has anyone else had this experience?
UPDATE:
I'm trying to decide whether to use MongoDb or PostGis for a scalable web system (Java Spring-Boot back-end), which requires accurate intersection queries. So for PostGis I'd probably use ST_Overlaps, and for MongoDb $geoIntersects. I will also use the spatially near functions however their accuracy isn't so important.
Thanks :)

How to do in-memory search for polygons that contain a given point?

I have a PostGreSQL table that has a geometry type column, in which different simple polygons (possibly intersecting) are stored. The polygons are are all areas within a city. I receive an input of a point (latitude-longitude pair) and need to find the list of polygons that contain the given point. What I have currently:
Unclustered GiST index defined on the polygon column.
Use ST_Contains(#param_Point, table.Polygon) on the whole table.
It is quite slow, so I am looking for a more performant in-memory alternative. I have the following ideas:
Maintain dictionary of polygons in Redis, keyed by their geohash. Polygons with same geohash would be saved as a list. When I receive the point, calculate its geohash and trim to a desired level. Then search in the Redis map and keep trimming the point's geohash until I find the first result (or enough results).
Have a trie of geohashes loaded from the database. Update the trie periodically or by receiving update events. Calculate the point's geohash, search in the trie until I find enough results. I prefer this because the map may have long lists for a geohash, given the nature of the polygons.
Any other approaches?
I have read about libraries like GeoTrie and Polygon Geohasher but can't seem to integrate them with the database and the above ideas.
Any cues or starting points, please?
Have you tried using ST_Within? Not sure if it meets your criteria but I believe it is meant to be faster than st_contains

Determine in which polygons a point is

I have tremendous flows of point data (in 2D) (thousands every second). On this map I have several fixed polygons (dozens to a few hundreds of them).
I would like to determine in real time (the order of a few milliseconds on a rather powerful laptop) for each point in which polygons it lies (polygons can intersect).
I thought I'd use the ray casting algorithm.
Nevertheless, I need a way to preprocess the data, to avoid scanning every polygon.
I therefore consider using tree approaches (PM quadtree or Rtree ?). Is there any other relevant method ?
Is there a good PM Quadtree implementation you would recommend (in whatever language, preferably C(++), Java or Python) ?
I have developed a library of several multi-dimensional indexes in Java, it can be found here. It contains R*Tree, STR-Tree, 4 quadtrees (2 for points, 2 for rectangles) and a critbit tree (can be used for spatial data by interleaving the coordinates). I also developed the PH-Tree.
There are all rectange/point based trees, so you would have to convert your polygons into rectangles, for example by calculating the bounding box. For all returned bounding boxes you would have to calculate manually if the polygon really intersects with your point.
If your rectangles are not too elongated, this should still be efficient.
I usually find the PH-Tree the most efficient tree, it has fast building times and very fast query times if a point intersects with 100 rectangles or less (even better with 10 or less). STR/R*-trees are better with larger overlap sizes (1000+). The quadtrees are a bit unreliable, they have problems with numeric precision when inserting millions of elements.
Assuming a 3D tree with 1 million rectangles and on average one result per query, the PH-Tree requires about 3 microseconds per query on my desktop (i7 4xxx), i.e. 300 queries per millisecond.

Algorithm for returning similar documents represented in Vector space model

I have a DB containing tf-idf vectors of about 30,000 documents.
I would like to return for a given document a set of similar documents - about 4 or so.
I thought about implementing a K-Means (clustering algorithm) on the data (with cosine similarity), but I don't know whether it's the best choice because of many uncertainties: I'm not sure what to put in my initial clusters, I don't know how many clusters to create, I fear the clusters will be too unbalanced, I'm not sure the results quality will be good, etc.
Any advice and help from experienced users will be greatly appreciated.
Thank you,
Katie
I would like to return for a given document a set of similar documents - about 4 or so.
Then don't do k-means. Just return the four closest documents by tf-idf similarity, as any search engine would do. You can implement this as a k-nearest neighbor search, or more easily by installing a search engine library and using the initial document as a query. Lucene comes to mind.
If I understand, you
read 30k records from a bigger db to a cache file / to memory
cosine similarity, 10 terms * 30k records -> best 4.
Can you estimate the runtimes of these phases separately ?
read or cache: how often will this be done,
how big are the 30k vectors all together ?
10 * 30k multiply-adds: in your c / java / ... or in some opaque db ?
In c or java, that should take < 1 second.
In general, make some back-of-the-envelope estimates
before getting fancy.
(By the way,
I find best-4 faster and simpler in straight-up c than std::partial_sort; ymmv.)