MongoDB: How can I store the following relation? - mongodb

I am an IT student and we are only learning RDBMS at university. I want to get in touch with MongoDB and develop a small video game collection manager for educational purpose only.
I want to have an overview of the available platforms, which are stored. E.g
PC
Xbox
Xbox 360
Playstation
and so on
When on of the platforms is selected i want to query for all entries matching the selected platform.
So my first idea is to have to collections: platform and game
platform:
_id : "PLATFORMID"
name : platform_one
game:
_id : "SOMEID"
name : game_One
publisher : somePublisher
platform : PLATFORMID
I know it is a best practice to write as much into a document as possible, but I think in this case it is not ideal, because to get all available platforms, I must query for all games and then iterate over the whole collection and pick out the platforms.
With my approach it would be possible to load only the platforms on startup, and then query for all the games.
Am i right or is there a much better solution with MongoDB?
Since this is my first project using a NoSql DB, any help or tips are appreciated.

Before you design you schema I would recommend reading this: http://docs.mongodb.org/manual/core/data-modeling/ because it will give you good ideas as to what the different data models in mongo are. This will create a firm foundation of how to do things in Mongo (which is very different then a traditional RDB).
Mongo schema is highly use case dependent and therefor the data structure is tightly integrated with the program. Don't think that is necessary to completely normalize your data, which usually isn't the right approach when using mongo. Also be aware that once that is true, there are many different schema structures to pick from, and often the performance needed is how that choice is made. Therefor two applications with the exact same data may have entirely different data structures and both are correct for their use case, and it would incorrect for them to use the others (if you like performance anyway).
For your specific case I would suggest putting the platform field into the game document. To get all the platforms there are a couple of approaches.
First option: have a collection that has all platforms in it. These can be stored in the _id field and be it's only contents, or you can have additional fields about that platform. These platforms will be unique and the program logic will have to keep it updated. Perhaps the application only offers platforms as a pick list from this collection when creating documents for the games collection. This will allow many clients to keep in sync over the platforms offered cheaply. Be aware there are no foreign keys so the program itself must keep these collections in sync.
Second option: Use distinct: http://docs.mongodb.org/manual/reference/command/distinct/. This will return the distinct values of a query. In this specific case:
db.runCommand({distinct: "game", key: "platform", query: {} })
However on large collections to get a static set of data, this can be overly expensive. So if the key is relatively fixed it is likely better to use the first option. If the key can mutate between queries then this option is best.
Best,
Charlie

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

MongoDB and one-to-many relation

I am trying to come up with a rough design for an application we're working on. What I'd like to know is, if there is a way to directly map a one to many relation in mongo.
My schema is like this:
There are a bunch of Devices.
Each device is known by it's name/ID uniquely.
Each device, can have multiple interfaces.
These interfaces can be added by a user in the front end at any given
time.
An interface is known uniquely by it's ID, and can be associated with
only one Device.
A device can contain at least an order of 100 interfaces.
I was going through MongoDB documentation wherein they mention things relating to Embedded document vs. multiple collections. By no means am I having a detailed clarity over this as I've just started with Mongo and meteor.
Question is, what could seemingly be a better approach? Having multiple small collections or having one big embedded collection. I know this question is somewhat subjective, I just need some clarity from folks who have more expertise in this field.
Another question is, suppose I go with the embedded model, is there a way to update only a part of the document (specific to the interface alone) so that as and when itf is added, it can be inserted into the same device document?
It depends on the purpose of the application.
Big document
A good example on where you'd want a big embedded collection would be if you are not going to modify (normally) the data but you're going to query them a lot. In my application I use this for storing pre-processed trips with all the information. Therefore when someone wants to consult this trip, all the information is located in a single document. However if your query is based on a value that is embedded in a trip, inside a list this would be very slow. If that's the case I'd recommend creating another collection with a relation between both collections. Also for updating part of a document it would be slow since it would require you to fetch the whole document and then update it.
Small documents with relations
If you plan on modify the data a lot, I'd recommend you to stick to a reference to another collection. With small documents, this will allow you to update any collection quicker. If you want to model a unique relation you may consider using a unique index in mongo. This can be done using: db.members.createIndex( { "user_id": 1 }, { unique: true } ).
Therefore:
Big object: Great for querying data but slow for complex queries.
Small related collections: Great for updating but requires several queries on distinct collections.

Designing MongoDB collections vs relational approach

I have difficulties in designing my MongoDB collections to fit my requirements. I only used SQL databases in my previous projects and am quite new to the NoSQL concept of MongoDB. My current learning project to get that concept is to store and retrieve statistics of games played (enhanced leaderboard example). In a relational database I would create the following tables:
Matches
:_id
:game_id (reference to the type of game played)
:startedAt
:endedAt
Results
:match_id
:player_id (reference to the users collection)
:field_id
:value
A match can have n players and each can have n results. Depending on the type of game, multiple result values indicated by field_id need to be entered for every player (e.g. number of points and whether the user won or not -> two fields = two rows in the results table).
As I understood that in MongoDB the concept is to store related information in one collection I tried to ignore what I did with relational DBs in the past year and created the following collection structure:
Matches
:_id
:game_id
:startedAt
:endedAt
:players [{
:player_id
:results [{
:field_id
:value
}]
}]
However I now have difficulties to calculate the overall results for a specific player. Queries as "calculate total sum of points player A had in game B" are quite complex and I fear that the performance is very bad. Therefore I would still prefer the relational model for this case. But as I wanted to learn the concepts of a NoSQL database I still wonder, whether I just misconcepted the DB and there is a good way to structure the data in a single collection for the queries.
Any recommendations are highly appreciated.
I am new to MongoDB, recently started learning but here is what I know so far:
NOSQL databases, such as MongoDB, are mainly used for their scalability and flexibility. For simple and small projects, I don't see a benefit.
The case you described is a classic case that SQL should be used.
If I was chosen to create that database and I HAD to use MongoDB, I would do it this way:
1) Keep the collection you created for the matches and
2) add a new collection based on the players.
The 2nd collection would be used for the leaderboards and for everything player-based. This means that there would be duplicate data but there is no other way to deal with the player searches, the ones you will need to do for the leaderboards.
Maybe it could work if only the latest matches were saved but still I see no benefit.
As I mentioned before I am in the learning process too, so I am not 100% sure.
Good luck with your project.

Best NoSQL for filtering on multiple indexes/fields

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you
This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice
OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.
Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

How to decide whether to store deep documents or thin related documents in a NoSQL database

New To NoSQL
In my 8 years of web development I've always used a relational database. Recently I started using MongoDB for a simple, multi-user web app where users can create their own photo galleries.
My Domain
My Domain is quite simple, there are "users" > "sites" > "photo sets" > "photos".
I've been struggling on how to decide how to store these documents. In the application sometimes I only need a small collection of "photos", and sometimes only the "sets", but always I need some information about the "user", and possibly the "site".
Thin Versus Deep
Currently I'm storing multiple thin documents, using my own implementation of foreign keys. The problem of course is that I sometimes have to make multiple calls to Mongo to render a single page.
Questions
Of course I'm sure there are ways to get around these inefficiencies, caches etc, but how do NoSQLers approach these problems:
Is it normal to related your documents like this?
Is it better to just store potentially massive deep documents?
Am I getting it wrong, and actually I should be storing multiple documents specifically for different views?
If you're storing multiple documents for different views, how do you manage updates?
Is the answer to use the "embed" features of Mongo? Is that how most solve this issue?
Thinks to think about when using a NoSQL Database, especially MongoDB:
How you manipulate the data?
Dynamic Queries
Secondary Indexes
Atomic Updates
Map Reduce
What about your Access Patterns (per Collection)?
Read / Write Ratio
Types of updates
Types of queries
Data life-cycle
Basic Knowledge:
Document writes are atomic
Maximum Document Size is 16Meg (with GridFS you could store larger files too)
Watch out for:
Careless Indexing
Large, deeply nested documents
Here=s an older talk about Schema Design: Schema Design Basics