Storing word frequency data - mongodb

I am trying to store word frequency data using Mongo. Each word needs to be associated to a user so I can calculate how often an individual uses each word. Currently my words collection looks like this:
{'Hello':3, 'user_id':1}
Which obviously only works on a 'One To One' basis and is no good.
I am trying to work out how best to make this a 'One To Many' relationshop between the user and the words. Would I store the user relationship in my words collection like so:
{'word':"Hello", 'users':[{'id':1, 'count':4},{'id':2, 'count':10}]}
Or would I attach the word counts to the user collection instead?
{'id':1, 'username':'SomeUser', 'words':[{'Hello':4}]}
The obvious disadvantage to the second approach is that the same words will be used across different users, so having a single words collection would help to keeping the data size down.
Can anyone advise me as to what I should do here? Is there a method I have perhaps overlooked in the documentation?

The obvious disadvantage to the second approach is that the same words
will be used across different users, so having a single words
collection would help to keeping the data size down.
Nope, that's the nature of using document db. Data size is really not a matter in non sql solutions, important thing is how easy and how fast you can access your data.
Your first approach is a typical textbook relational model. There is no advantage of using this in mongo (Though you can model this in relational way in mongo). Instead the second approach gives you
Fatser reads/writes since every word is stored inside user. You dont need to perform multiple queries for this

Related

NoSQL how to lookup id in a collection

NoSQL noob here. I'm building an app using Firestore NoSQL. I'm looping through items where every item has a owner id (creator user id).
I want to display owner's name on the listing page. In traditional SQL, i have foreign key so I can just make reference to say, Item.Owner.FirstName
What's the best practice in NoSQL? Should I be saving owner name as a field at the time of saving the item? or do a lookup of each owner id to get user object whilst i'm looping through items?
Second option sounds expensive so i'm assuming 1st way is the way to go. Unless there's a better, more accepted way?
Both will work. You either reference the data in the other document in whatever way you see fit, or you duplicate information into the document that you intend to query to build the display. You just have to decide what which problem you want to deal with:
If you duplicate data among documents (known as "denormalization"), then you'll have to put effort into keeping them all up to date with each other, if that's what you require. So, writing one document might actually turn into writing multiple documents.
If you normalize your data with no duplication, then each of your queries will require more queries to get the related data from other documents. This could result in a drop in performance and an increase in cost for apps with heavy read loads.
Since we don't know the performance requirements and usage behavior of your app, there is no way to give specific advice. You will have to think carefully about which problem you want to have, perhaps based on complexity, performance, and overall cost.

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

Basic set-based operations using a document database (noSQL)

As with most, I come from and RDMS world trying to get my head around noSQL databases and specifically document stores (as I find them the most interesting).
I am try to understand how to perform some set-based operations using a document database (I'm playing with RavenDB).
So as per my understanding:
Union (as in SQL UNION) is very straight forward append. Additionally
unions between different sets (SQL JOIN) can be achieved map/reduce. The
example given in the RavenDB mythology book with Comment counts on
Blogs entries is a good start.
Intersection can be performed using a number of techniques from
de-normalization right through to creating a “mapping” or “link”
document as described here (and the aggregator example below). In an RDMS this would be performed using a simple "INNER JOIN" or "WHERE x IN"
Subtract (Relative Complement) is where I am getting stuck. In an RDMS this operation is simply a "WHERE x NOT IN" or a "LEFT JOIN" where the joined set is NULL.
Using a real world example let’s say we have an RSS aggregator (such as Google Reader) which has millions if not billions of RSS entries with thousands of users, each tagging favourite, etc.
In this example we focus on entry, user and tag; where tag acts as a link between user and entry.
user {string id, string name /*etc.*/}
entry {string id, string title, string url /*etc.*/}
tag {string userId, string entryId, string[] tags} /* (favourite, read, etc.)*/
With the above approach it is easy to perform the intersection between entry and user using tag. But I cannot get my head around how one would perform a subtract. For instance “Return all items that do not have any tags” or even more daunting “return the latest 1000 items without any tag”.
So my question:
Can you point me to some reading material on the matter?
Can you share some ideas on how one can accomplish the task
efficiently?
Note: I know that you lose query flexibility with document databases, but surely there must be a way to do this?
Amok,
What you want cannot really be done easily in non relational databases.
Mostly because they don't think in sets and have strong ties to distributed computing.
You can't really do efficient sets without having access to all the data, for example, and that pretty much means that any set based operation is going to have to need access to all of that.
Since NoSQL dbs are usually used in distributed scenarios, they can't really support that.
RavenDB, specifically, allows some operations on a specified set, but it is built strongly on the assumption of independent documents, that don't have strong relations to other documents, or documents that need to be manipulated all together in the same fashion.
Transition from RDBMS to a document database isn't completely smooth, and some refactoring to your Model may be necessary to make it optimal. This is due to the different natures of those technologies.
Re. set-based operations in RavenDB, see:
http://ayende.com/blog/4535/set-based-operations-with-ravendb
http://ravendb.net/documentation/set-based

MongoDB schema design - reference vs embedding

I am writing a simulation which requires a backing database to store the results. The simulation writes a huge amount of data. For obvious performance reasons, I chose to try out a NoSQL database, specifically MongoDB. However, I'm a bit puzzled over my data model.
In relational world, the schema would translate to this:
Simulation holds simulation configuration, status, etc.
Scenario describes a specific simulation case.
Realization groups TestResults.
The simulation work as the following. First we create configuration (maps to Simulation table) and specify scenarios and how many Realization to calculate. Then we start the simulation. The simulation creates realizations in a scenario (in parallel, so many realizations and calculated at the same time and inserted into the scenario the simulation is currently working on).
However, in NoSQL, specifically MongoDB, relations are bad and slow, so I should make use of embedded documents as much as possible. So I came up with this:
This model should give me the best performance when first calculating all realizations and THEN saving it to the database as a single insert (of Scenario).
However, for performance reasons, I want to insert a Realization into Scenario as soon as it is computed. Which would require updating the Scenario every time a realization is complited. Is this a bad idea ? It says on the MongoDB reference that when adding a embedded document into a parent document, the parent document is updated but there is a performance loss anyway.
Would it be faster not to embed Realization into Scenario but reference it ? How much performance would be lost when reading and aggregating the data later ? Any other pitfalls I should know ?
Thanks.
It depends how you will use the data - embedding can involve updating multiple documents, so is slow to write but reading is always one document only so will be fast. Referencing is the opposite - writing to a single document (fast) but reading multiple documents (slow).
Aside from potential limitations like reaching a maximum size for embedded documents, it just comes down to which type of performance is more important for your scenario.
another thing that you should consider is if you are going to update your records,
for example if you have a embedded list of users (let's say friends), if you change the first name of one of the users in users collection, you must iterate the whole friends list and manually update their first name.