Database design for time and geolocation data - orientdb

I have successfully implemented the Time Series Use case like shown in the documentation. The data(Event class) pointed by the smallest time unit is indexed with a lucene spatial index.
I have two types of events : private or public.
How should I design my database and clusters for my use case ?
Would it be better to use an Edge or a linklist to make the link from Min to Event ?
I am worried that my lucene spatial index will get too big in the future.
By reading at the documentation, it looks like having clusters for the geolocation data would be a great strategy.
It is possible to use the index only on the subquery:
select
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].events)
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
The documentation on indexes tells me it is possible to use indexes on an edge properties. The bad side is that it takes more storage then linked list I tested it and it works :
select
expand(inV())
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].outE('MinPublicEvent'))
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]

In regard of edge vs link, taken from OrientDB doc: lightweight edges, the first difference is that edges can store properties and links don't.
These are the PROS and CONS of Lightweight Edges vs Regular Edges:
PROS:
faster in creation and traversing, because don't need an additional
document to keep the relationships between 2 vertices
CONS:
cannot store properties harder working with Lightweight edges from
SQL, because there is no a regular document under the edge
Since you already mentioned using properties on edges, which makes sense to me, as you can use these properties in the edges to transverse the graph, this means that you can't use a link to store that relationship.
In the case you want to embed these properties on the Event vertex, that is also fine, and you'd be able to use links, loosing the hability of using the properties in the edge to transverse the graph in favour of improved performance.
The edge approach is more expressive, but when performance really matters, and there is risk of a bottleneck, you should monitor the metrics and the performance, and refactor to the embed + link approach in case there is an issue with performance.
Update:
Clusters are basically a mechanism to split data in OrientDB (Clusters tutorial), which works for both, edge and vertex.
You may also find it beneficial to locate different clusters on
different servers, physically separating where you store records in
your database. The advantages of this include:
Optimization: Faster query execution against clusters, given that you
need only search a subset of the clusters in a class.
Indexes: With good partitioning, you can reduce or remove the use of > indexes.
Parallel Queries: Queries can be run in parallel when made to data on
multiple disks.
Sharding: You can shard large data-sets across
multiple instances.
Unless you can identify clearly a good way to partition your data, and can distribute your database between different servers, i suggest you start with the default, as OrientDB already creates 1 cluster for each class in the schema, and add more clusters as your database grow.
When to add more clusters? Metrics, metrics and metrics. Keep track of how your application access your database, what kind of queries, amount of queries, etc.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

How to compute connected components in OrientDB

Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'

Is there a graph-document database that supports geospatial queries?

Here's the rundown of what I need:
A graph database
Each node is a document; there will be hundreds of types of nodes; each of these several hundred types will have its own consistent schema.
Can scale to billions of nodes
Each node also has a (lat,lng) cooordinate in addition to the edges between nodes
I want to use (lat,lng) as a shard key so this can be scaled to a large sharded, replicated cluster. Edge traversals will occur ~95% within nearby (lat,lng) locations.
I want to be able to issue geo+document queries. For example "Show me all the graph nodes/documents matching this query { ... } ordered by distance from (lat_0, lng_0)"
I want something that's well-documented, has an active developer community, is recommended for production use, and likely to be around for years.
Here are problems with existing databases:
MongoDB: no graph support, no joins
Neo4j: no sharding
OrientDB: no geospatial indexing
ArangoDB: can do WITHIN queries but cannot have additional query clauses (e.g. MongoDB's geoNear has a query parameter)
Is there anything that fits my use case?
Would you like a unicorn and a machine that prints an unlimited number of $100 bills to go along with that? Har har har....
OK but seriously, you've got a tall order there. You're going to need a custom system that blends a few of those things together. For one, as you observe, there's really no such thing as a "graph/document" database.
As a general area of systems research, many people are looking into hybrid systems. An example would be that you maintain your graph structure in neo4j, and that the IDs of nodes in neo4j point to identifiers for documents in MongoDB. In this way, you'd have a graph/document database, but it would really be two databases. Such hybrid systems are rife with tradeoffs. For one, writing a query across both systems will be extremely difficult. For two, you'll introduce data dependencies across them, such that it might not be easy to update your graph structure without changing your documents, or vice versa.
For really intense performance requirements, hybrid systems are sometimes the only way to go. But just as a rule of thumb, for every 100 times you see someone say they need such a solution, probably 80 times they're better off with picking just one database and then living with the pros and the cons that it provides to them. Technology is ultimately about choices, pros, and cons, and learning to live with what you've picked. :)
To give you a succinct answer to the question you've asked, no there's nothing that does all of that. I'd recommend you work with an architect or consultant who can explore your requirements in depth, and make a recommendation on what architecture best suits most of your needs, balancing simplicity and cost. That's as much an art as a science.

120 mongodb collections vs single collection - which one is more efficient?

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance
Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).
The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.

Tag aware sharding in Mongodb

I have been reading about tag aware sharding..These are the links I referred:
http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding
http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Kristina has explained the concept in a very lucid manner and one thing is for sure: this enhancement is going to make MongoDB more developer-friendly.
But my question is.. It looks like tagging/retagging is meant to easily migrate chunks around..get all writes on to a preferred data center etc..But how does this fit into the old system of range partitioning and the way Mongo learns key-distributions for balancing? It is said that the shard-key cannot be changed, and that's because the data is assumed to be distributed across shards and changing the shard-key would disturb this. Isn't applying a tag essentially doing the same? So is tag-aware sharding meant to handle this problem?
EDIT:
And any idea how are the indexes affected by such huge migrations?
Aafreen,
You are correct. At this stage shard-tagging performs many of the same functions as balancing with the shard key. The one thing it does not do is perform any level of distribution beyond that of tagging. So it is probably more correct to say that the tagging architecture lives on top of the existing sharding architecture.
You must keep in mind that tagging only governs:
a) where tagged data will go, untagged data will use the shard-key
b) that tagged data shared amongst a number of tagged servers will still need to be distributed
You can most certainly use the tag aware sharding to manually control data distribution in the same manner that the balancer does now, by making granular enough tags so that data is put where you want it and distributed evenly.
The use case however, is more like the Documentation you linked. Where you have a large number of shards broken up into a smaller subset. In this example you would be tagging each object and then the tag would push it to the correct geographic location (for lower latency retrieval) and once within the correct geography the original sharding architecture would take over and distribute amongst the tagged shards.
As for indexes, they are heavily affected by migrations, as they need to be repointed. But the level of load is the same for that of a large number of chunk migrations - like adding a new shard to a cluster.