Storing multidimensional data in MongoDB - mongodb

I am attempting to convert a large MySQL database of product information (dimensions, specifications, etc) to MongoDB for flexibility and to move away from the restriction of a column based table. I like the freedom of being able to add a new key-value, without making an update to table structure.
An example of my data is here: http://textuploader.com/9nwo. I envisioned my collection being "Products", with a nested collection for each product type (ex. Hand Chain Hoists), and a nested collected manufacturer (ex. Coffing & Harrington). Basically one large multidimensional array. I am learning that nested collections are not allowed in MongoDB, so I'm at a dead end.
How would you store this kind of dataset? Is NoSQL the right choice for this?

If the nested structure is not necessary, you can "flatten" your data by having each document in the collection be a specific product and each product can have manufacturer product type as fields. You can index these fields so you can still query them quickly without a nested structure.
If you need to preserve the hierarchy, mongodb actually has a tutorial on designing a product hierarchy model here: http://docs.mongodb.org/ecosystem/use-cases/category-hierarchy/. Essentially, each product has an ancestors array. It is a bit trickier keeping the hierarchy up to date on inserts and updates

Related

Dynamic Data inside MongoDB

I'm making dynamic data storage, which it was in relational data base as:
form -> field
form -> form_data
field -> field_data
form_data -> filed_data
and field data contains field_id, form_data_id, and value
but for scalability and performance I'm planing to move form_data and field_data to MongoDB and my problem now is how to design MongoDB collections, using one collection for all form_data and move field_data to map inside it and key is field_id and value is the value of this field_data, or make collection for each form record and store data direct in form_data without map as the all data in this case will be consistent.
In document-oriented databases like MongoDB you should always favor aggregation over references, because they don't support JOIN-operations on the database to link multiple documents.
That means that a "A has many B" relationship between two entities should not be modeled as two tables A and B, but rather as one collection of A where each A has an embedded array of B objects.
In the context of MongoDB, there is however an additional limitation: MongoDB doesn't like objects which grow. When an object accumulates more and more data over its lifetime, the object will grow. That means that the hard drive space MongoDB allocated for it will run out again and again, which will require space reallocation. This takes performance and fragments the database. Also, MongoDB has an artificial size limit for documents, mainly to discourage developers from designing growing objects.
Therefor:
When the data exists at creation, embed it directly.
When more and more data is added after creation, put it into a different collection.
When a form has X fields, the number of fields will likely not change much over its lifetime. So you should embed the fields and their descriptions directly into the form object.
But the number of answers entered into these forms will grow over time, which means that these should be treated as separate objects in a separate collection.
So I would recommend you to have two collections, forms and form_data.
Each document in forms embeds a sub-object of fields with the static field properties.
Each document in form_data has a field with the _id of the corresponding form and embeds a sub-object of field_data which uses the same keys as the fields sub-object of forms and stores the entries the user made in that form.
When your use-case requires frequent access to the aggregated data (like when you want to publish the up-to-date statistics on a public website), you could also store this information in the fields of forms to avoid an expensive aggregation query over many form_data documents. MongoDB in general recommends to orient your database schema rather on your performance requirements than on the semantics of your data.
Regarding your remark "all data in this case will be consistent": keep in mind that MongoDB does not enforce referential integrity. When an application deletes or changes a document, it's the responsibility of the application to fix any outdated references to it in other documents.

Mongodb model to store user/item specific data

The case:
There are users in system, and there are static documents (like books) Each user may work with some documents and have specific state/settings (like current position/page in document, bookmarks/notes) for each of his docs.
What is a better way to store that user and document specific information in flat collection with two keys userId and documentId or collection that have _id equal to userId and nested array of subdocuments that have _id equal to documentId (in that scenario collection is also used for storing non-document specific user data)?
1st scenaroio: find({userId: ..., documentId:...})
2nd scenaroio: findBy({_id:...}), then find sub doc with _id equal to documentId
PROS of 1st scenario:
1) I believe quicker find and save operations.
CONS of 1st scenario:
1) greater amount of documents
2) no way to store some non-doc related user-specific data in collection
PROS of 2nd scenario:
1) better representation of data relations (subjective though)
2) makes possible to use the same collection to store some other non particular document related user data.
CONS of 2nd:
1) more difficult search and more difficult save operations (I'm using using Mongoose ODM and code would not be complex), and I think the operations is less speedy then in 1st scenario.
Some things to consider:
1) In general in read operations I would to select only one document specific data
2) I would need OFTEN to save one document specific data (for example periodical saving of position in document that user is working with).
3) User/document state may have some nested arrays (bookmarks, notes) that have to be changed (docs inserted/removed)
Taking this considerations I would say that 1st scenario is more suitable for the task, but I would like to hear some pro opinions, whether two scenarios differ greatly.
What are your actual access paths? Do you start with a user id, and the look for the documents the user reads? Or do you start with a document and search for the users, that read it?
Is the document object lightweight (just title and author and suchlike information) or is it heavyweight (includes the contents)?
If documents are heavyweight, I'd keep them in a separate collection and go for scenario 2.
Basically scenario 1 mimics a relational solution and scenario looks like an object model.
I believe object models describe the reality better and are more efficient.
So I'd go for scenario 2, unless you frequently search the readers for a book.

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

Zend: index generation and the pros and cons of Zend_Search_Lucene

I've never came across an app/class like Zend Search Lucene before, as I've always queried my database.
Zend_Search_Lucene operates with
documents as atomic objects for
indexing. A document is divided into
named fields, and fields have content
that can be searched.
A document is represented by the
Zend_Search_Lucene_Document class, and
this objects of this class contain
instances of Zend_Search_Lucene_Field
that represent the fields on the
document.
It is important to note that any
information can be added to the index.
Application-specific information or
metadata can be stored in the document
fields, and later retrieved with the
document during search.
So this is basically saying that I can apply this to anything including databases, the key thing here is making indexes for searching.
What I'm trying to grasp is where exactly should I store the indexes in my application, let's take for example we have phones stored in a database, a manufacturers, models - how should I categorize the indexes?
If I'm making indexes of users with say, addresses I obviously wouldn't want them to be publically viewable, I'm just confused on how it all works out together, if there are known disadvantages, any gotchas I should know while using it.
A Lucene index is stored outside the database. I'd store it in a "data" directory as a sister to your controllers, models, and views. But you can store it anywhere; you just need to specify the path when you open the index for querying.
It's basically a redundant copy of the documents stored in your database, and you have to keep them in sync yourself. That's one of the disadvantages: you have to write code to populate the Lucene index based on results of a query against your database. As you add data to the database, you have to update your Lucene index as well.
An advantage of using an external full-text index solution is that you can reduce the workload on your RDBMS. To find a document, you execute a search using the Lucene API. The result should include a field containing the primary key value (as part of the document but no need to make it analyzed for FT search). You get this field back when you do a Lucene search, so you can look up the respective row in the database.
Does that help answer your question?
I gave a presentation recently for MySQL University comparing full-text search solutions:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
I also publish my slides at http://www.SlideShare.net/billkarwin.