Relational or Mondo Db in a network of location application? - mongodb

I'm trying to start a project where a user can add a location and each location can have a relation to each other. It will be a web application and I'm planning to use javaee7 + jpa + wildfly.
Basically a location will have at least a name, text, longitude and latitude fields. Some other entities will be linked to this location. There will also be query like, search he nearest x location to y.
I've encountered mongodb several times before but, I'm always on the other part of the system, so I'm not really sure if there's a big benefit in using it in this kind of system. The application is not a store so there's no checkout process and therefore no transaction.
If ever I decided to use mongodb I'm looking at:
http://hibernate.org/ogm/ and
mongo db geospatial queries
I would like to ask for opinions. Thanks.

From your description, I think MongoDB is a pretty good choice:
Your data model is fairly simple. You might even think of embedding the other entities in the location documents, reducing the number of queries needed to retrieve these entities which are near to a specific point to exactly 1 by doing
db.collection.ensureIndex({fieldHoldingGeoJSON:"2dsphere"})
db.collection.find({
$near:{
$geometry:{
type:"Point",
coordinates: [latitude, longitude]
},
$maxDistance: distanceInMeters
}
})
Keep in mind though that the maximum size of a BSON document is limited to 16MB. So if there are a lot of related entities, it might be worth to actually embed the locations into the entities eliminating the need of a dedicated locations collection, though those still can be created on demand using the aggregation framework. With embedding the locations into the entities, you can scale basically indefinitely and you have all properties of said entities stored where they belong to (very NoSQLish).
Although the second approach imposes redundancy, we are only talking a a few bytes, and even with 100M records, that would be less than 1Gb, with disk space being cheap.
Also keep in mind that MongoDB comes with powerful features such as ( relatively ) easy scaling and an awesome aggregation framework, to name a few.

MongoDB supports queries where you can ask for "nearest" within a radius -- very useful, if thats what you are looking for.

Actually, PostgeSQL has the extension PostGIS that implements spacial and geographical objects that can be queried using SQL. An example extracted from their site is the following:
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

MongoDB and one-to-many relation

I am trying to come up with a rough design for an application we're working on. What I'd like to know is, if there is a way to directly map a one to many relation in mongo.
My schema is like this:
There are a bunch of Devices.
Each device is known by it's name/ID uniquely.
Each device, can have multiple interfaces.
These interfaces can be added by a user in the front end at any given
time.
An interface is known uniquely by it's ID, and can be associated with
only one Device.
A device can contain at least an order of 100 interfaces.
I was going through MongoDB documentation wherein they mention things relating to Embedded document vs. multiple collections. By no means am I having a detailed clarity over this as I've just started with Mongo and meteor.
Question is, what could seemingly be a better approach? Having multiple small collections or having one big embedded collection. I know this question is somewhat subjective, I just need some clarity from folks who have more expertise in this field.
Another question is, suppose I go with the embedded model, is there a way to update only a part of the document (specific to the interface alone) so that as and when itf is added, it can be inserted into the same device document?
It depends on the purpose of the application.
Big document
A good example on where you'd want a big embedded collection would be if you are not going to modify (normally) the data but you're going to query them a lot. In my application I use this for storing pre-processed trips with all the information. Therefore when someone wants to consult this trip, all the information is located in a single document. However if your query is based on a value that is embedded in a trip, inside a list this would be very slow. If that's the case I'd recommend creating another collection with a relation between both collections. Also for updating part of a document it would be slow since it would require you to fetch the whole document and then update it.
Small documents with relations
If you plan on modify the data a lot, I'd recommend you to stick to a reference to another collection. With small documents, this will allow you to update any collection quicker. If you want to model a unique relation you may consider using a unique index in mongo. This can be done using: db.members.createIndex( { "user_id": 1 }, { unique: true } ).
Therefore:
Big object: Great for querying data but slow for complex queries.
Small related collections: Great for updating but requires several queries on distinct collections.

Which (near) realtime spatial database for 1M+ entries?

I am starting an analytic project which will handle millions of geolocalised datas.
The data will be probably something like this :
id{
userId,
long,
lat,
time,
appId
}
My main operations :
get all the data included in a zone
find all the point belong to a userId
pub/sub to show all new entries
add/remove a field on all datas (or just fews)
I would like to use Meteor.js and need near realtime performance (~0,5s to 3s max).
Maybe it is important : I need a precision between 3-15m in my result.
So I looked on :
Redis : seams simple to use, there is a Redis Geo plugin. Plus there is a driver for meteor.
PostGIS : real time performance with M+ of entries? No driver for meteor.
PostGre : there is a driver for meteor.
Hbase : seams build for big tables. No driver for meteor.
Which one would you use? (Any other suggestion would be appreciated.)
There is a postgres-client for nodejs, this should be useable with meteor. It works like a charm, when it comes to PostGIS (used it myself in some projects). You have to take care of the output though (using postGIS-output-functions (e.g. ST_AsGeoJSON), in combination with ARRAY, while designing your queries).
PostGIS is probably the best choice, when it comes to spatial queries. It is thouroughly tested, properly maintained and is used in many applications.
I can not make any assertions on your performance-constraints though. Spatial queries are inherently complex (e.g: polygon intersection has at best a O(n^2) complexity). You might be able to mitigate performance-issues with indices and caches though. Always worked for me, but i never had to deal with tight query-time constraints.
Regarding you operations: All but the first one should cost next to nothing (database-performance wise). The First query can be a bit tricky, as you will have to use one of the following functions: ST_Intersects(), ST_Contains() or ST_Covers(). All of these have a complexity bigger than O(n). Your queries can be designed, so that it runs quite fast, but as I said: I don't know if your constraints are respected.

neo4j vs mongodb for spatial search

I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?
Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)
I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.
Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.
Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.

Cassandra Or MongoDB For Our Location Based Application

We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.