Is there a database supporting 2 geospatial indexes? - mongodb

I am looking for a database implementing 2 geospatial indexes or allowing to simulate it efficiently.
Motivation: our application deals with vectors, rather than locations and we often need to locate all the records where the source is near something and the destination is near something else.
Mongodb does not have it. Is there a database, which does?
May be it could be simulated with the mongodb map reduce feature, where the database looks up all the records satisfying the source constraint and then passes it through the map-reduce to leave those, which satisfy the destination constraint as well. Did anyone do it?
Thanks.

It might be possible to fake this with MapReduce in Mongo, but this would only be suitable for use as a batch job and not likely to perform well as an application level query.
A workaround would be to store the source and destination points in separate Mongo collections. Then you could do a query on the source collection using $near to pull the closest points to the source point, then do another $near query against the destination collection, and compute the intersection in memory.
Another option - since you can use the geospatial index to index a field which contains an array of points, store both the source and destination points as elements in an array. Then issue two queries to that collection (one for the source point, one for the destination) and scan through the two result sets to calculate the final result (the queries won't distinguish between which match was the source and which was the destination, so you'd have to check that on the client side).

Related

Mongodb Performance of getting a document in the case of non existing document

We are storing lots of data in mongodb let's say 30M docs. And these documents does not get modified very often. There are high number of read queries(~15k qps). And many of these queries(by _id field) will result in empty search result because of the nature of our use case.
I want to understand if mongodb does some sort of optimisation for detecting if a doc is not available in the db,index or not. Are there any plugin to enable this? Other option that I see is to use application level bloom filter but that would be another piece to maintain. AFAIK HBASE has support for bloom filter to see if a document is present or not.
Finding a non-existent document is the worst case of finding a document. Same as in real life, if what you're looking for doesn't exist you'll need more time to check all the places than if the thing existed at some point.
All of the find optimizations apply equally to finding documents that end up not existing (indexes, shard keys, etc.).

data model for tree structure (file system): document model vs graph model

I'm evaluating a nosql solution for implement a file system like structure, with millions of items, where key features have to be:
speed finding "parents" or "direct childs" or "subtree childs" of an item filtered by n item properties, with page results sorted by item property.
having this requirements i split the problem in 2 task:
model the recursive items structure for search childs/subtree childs
model the item structure for search over items property
Now the power of nosql schema free is a good feature for store different properties for each file, and this is good for point 2.
I have instead some doubt over point 1 about the pros / cons to use a document database (example mongodb) with a single collection of items and a materialized path design pattern, or using a graph database (example arangodb) with 2 collection: items for data (document collection), and itemsParents for parent-child relation (edge collection) and a graph traverse function.
There are advantages in performance using a graph database for my requirements?
graph traverse is more efficient over materialized path filter to accomplish my task?
If yes, can you explain me why?
Thanks
A graph database would certainly be a great choice for a hierarchical structure like a filesystem. Speaking specifically of Neo4j you could have a schema such as:
(:Folder)-[:IN_FOLDER]->(:Folder)
(:File)-[:IN_FOLDER]->(:Folder)
Finding a file or a folder is as simple as the following Cypher:
MATCH (file:File {path: '/dir/file'})
RETURN file
To find all of the files/folders directly under a folder:
MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER]-(file_or_folder)
RETURN file_or_folder
If you wanted to find all files/folders recursively you could do:
MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER*1..5]-(file_or_folder)
RETURN file_or_folder
The 1..5 adjusts the depth (from one to five levels) which you are searching.
For all of these you'd want an index on the path property for both the Folder and File labels. Of course you wouldn't need to do it this way, depending on your use-case.
The reason that Neo4j can be so much faster in this case is because once you find a node on disk the relationships can be traversed with just a few file accesses as opposed to searching an entire table or index for each hop. I recommend checking out the free book Graph Databases by O'Reilly for details on the internals of Neo4j.
Your use case is better served by a multi-model database, which is both a document store and a graph database. With such a data store you can put all your items as vertices in one collection and have the relations for the hierarchy as edges in a separate collection. Additionally, you could store the path with every item and have a sorted index and, if constant time matters, a hash index on the path attribute.
You would get
O(1) (constant time) lookup for an item by its path (using the hash index)
O(1) lookup for a parent either by a graph neighbour or by a lookup by (truncated) path
finding all n direct children in time O(n) using graph neighbours
finding a complete subtree by a range lookup in the sorted index in time proportional to the number of items in the result
get near arbitrary fast item accesses by adding further secondary indexes
1.-4. is best possible, because the complexity cannot be better than the size of the result set.
Let me explain the performance arguments in a bit more detail:
Cases 1. and 2. are good with both approaches, since you only need a single result, which you can access directly. Whether or not you use a hash index or let a sorted index suffice will probably matter very little.
Case 3. (direct children) is faster with a graph database, since it has "find all direct neighbours" as a very fast primitive. If you rely on the materialized path and a sorted index, you cannot get the direct children with a range query and so it will be slower.
Case 4. (whole subtree) is faster with a materialized path and a range query using a sorted index, since that can just emit (using an iterator if you want) a complete range. A graph database will have to perform a graph traversal which will involve (temporary) data mutations on the server for the enumeration.
The advantage with a multi-model database is that you can combine the advantages of the two approaches (3. and 4.) you suggest in a single data store.
One possibility for such a database is the ArangoDB database.
Disclaimer: I am one of the main developers.

MongoDB: Optimization of Performance: Aggregation Pipeline (one collection) VS Aggregation plus Additional Query on Seperate Collection

I would like to know what is faster in terms of querying for mongodb.
Lets say I would like to search for income information based on areas
And a person can have many residencies in different states. And each polygon area would have an associated income for that individual.
I have outlined two options for querying this information, I would like to know which would be faster to search.
1) To have a single collection which has two types of documents.
Document1: has a geospatial index on it with polygons, and will have
2dsphere index on it. It will be searched with aggregation to return ids that will link to document 2. Essentially taking the place of a relation in mysql.
Document2: has other information (lets say income amount) and different indexes, but has an id
which the first document also has to reference it.
And also has an index on income amount.
The two documents are searched with an aggregation pipeline.
Stage 1 of pipeline: searching document1 geospatially for items and getting the id value .
Stage 2 of pipeline: using id found in document1 to search second document. As well searched by income type.
2) Seperating out the documents where each has its own collection and avoiding aggregation.
querying collection1 for geospatial and using the person id's found to query collection2 for income info.
3) A third option involving polyglot database, a combination of mongodb and postigs: Query postgis for the id and then use that to search mongodb collecton. I am including this option since I believe postgis to be faster for querying geospatially than mogo but I am curious if the speed of postgis will not matter due to latency of now querying two databases.
The end goal is to pull back data based on a geospatial radius. One geospatial polygon representing area where the person lives and does business for that income.
maps to 1 relational id and each relational id maps to many sets of data. Essentially I have a many to 1 to many relationship.
Many geospatials map to 1 person which maps to many data sets.
You should generally keep collections limited to a single type of document.
Solution 1 will not work. You cannot use the aggregation pipeline the way you are describing (if I'm understanding you correctly). Also, it sounds as though you are thinking in a relational way about a solution using a non-relational database.
Solution 2 will work but it will not have optimum performance. This solution sounds even more like a relational database solution where the collections are being treated like tables.
Solution 3 will probably work but as you said it will now require two databases.
All three of these solutions are progressively pulling these two types of documents further and further away from one another. I believe the best solution for a document database like MongoDB is to embed this data. It's impossible without a real example of your documents and without a clear understanding of your application to suggest an exact solution. But in general embedding data is preferred over creating relationships between documents in MongoDB. As long as no document will ever get to be over 16MB it's worth looking into whether embedding is the right solution.

Serve clients with geospatial news

I am developing an app where the user will receive geo based information depending on his position. The server should be able to handle huge number of connected clients >100k.
Now i came up with 4 Approaches on how to handle the users position.
Approach - Without geospatial index:
The app server does just hold a list of connected clients and they're location.
Whenever there is a information available the server does loop over the whole list and checks whether the client is within a given radius.
Doubts: Very expensive
Approach - Handle geospatial index in the app server:
The app server does maintain a R Tree with all connected clients and they're location.
Therefore i was looking at JSI Java Spatial Index
Doubts: It is very expensive to update the geospatial index with JSI
Approach - Let the database "mongoDb" do the geospatial index / calculation:
The app server does only hold a reference to the connected client (connection) and saves the key to that reference together with its location into mondoDb.
When a new information is available the server can query the database to get the keys off all clients nearby.
Pro: I guess mongoDb does have a much better implementation of geospatial indexes than i could ever do in the app server.
Doubts: Clients are traveling around which forces me to update the geospatial index frequently. Can i do that or am i running into a performance problem?
Approach - Own "index" using 2 dimensional array
Today i was thinking about creating a very simple index by using a two dimensional array. While the outer array is for the longitude the inner would be for the latitude.
Lets say 3 longitude / altitude degree would be enough precision.
I could receive a list of users in a given area by
ArrayList userList = data[91][35] //91.2548980712891, 35.60869979858;
// i would also need to get the users in the surrounding arrays 90;35, 92;35 ...
// if i need more precision i could use one more decimal data[912][356]
Pro: I would have fast read and write access without a query to the database
Doubts: Radius is shorter at poles. Ugly hack?
I would be very grateful if someone could point me into "the" right direction.
The index used by MongoDB for geospatial indexing is based on a geohash, which essentially converts a 2 dimensional space into a one-dimensional key, suitable for B-tree indexing. While this is somewhat less efficient than an R-tree index it will be vastly more efficient than your scenario 1. I would also argue that filtering the data at the db level with a spatial query will be more efficient and easier to maintain than creating you own spatial indexing strategy on top.
The main thing to be aware of with MongoDB is that you cannot use a geometry column as a shard key, though you can shard a collection containing a geometry field, using another key. Also, if you wish to do any aggregation queries, (which isn't clear from your question) the geometry field must be the first through the pipeline.
There is also a geohaystack index, which is based on small buckets and optimized for search based on small areas, see, http://docs.mongodb.org/manual/core/geohaystack/, which might be useful in your case.
As far as speed is concerned, insertion and search on a B-Tree index are essentially O(log n), see Wikipedia B-Tree while without an index your search will be O(n), so it will not take very long before the difference in perfomance is enormous between having and not having an index.
If you are concerned about heavy writes slowing things down, you can tune the write concern in MongoDB so that you don't have to wait for a majority of replicas to respond to every write (the default), but at the cost of potentially inconsistent data, if you should lose your master.

MongoDB spatial query on source and destination

I have a collection, name Events:
Each document in Events collection has source and destination in lat-long.
I would like to make a query on the Events collection and get only those events that are within some distance from source and within some distance from destination.
I read that MongoDB does not support two geospatial indexes on one collection.
I am confused as in how my data model should look like and how can I make a query to achieve my purpose?
Thanks
You'll have to work with the geo index limitation, so it leaves you with a couple of options. One is to create two collections and run two queries, and resolve the intersection at the application level. This could be expensive depending on what you're doing.
The other scenario is to work with one collection, but change your query to check $geowithin some geometry which represents the intersection area of the areas around your source and destination. Since you're querying for events that are both within some distance of your source and destination, this implies that there is an intersection area. It is up to you to calculate the intersection geometry. If possible, you can precalculate and store these intersections.
You may only have 1 geospatial index per collection, for now. While MongoDB may allow to create multiple indexes, this behavior is unsupported. Because MongoDB can only use one index to support a single query, in most cases, having multiple geo indexes will produce undesirable behavior
There are a few tickets in JIRA for this that you might want to vote on:
https://jira.mongodb.org/browse/SERVER-2331
https://jira.mongodb.org/browse/SERVER-3653