Analytical Queries with MongoDB - mongodb

I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.

MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.

Related

How to avoid MongoDB Aggregation in a specific query?

I'm using MongoDB and need some analytics query to produce some report of my data. I know that MongoDB is not a good choice for OLAP applications but if I want to use MonogoDB one solution could be pre-computing the required data. For instance we can create new collections for the any specific OLAP query and just update that collection when some related events happen in system. Consider this scenario:
In my app I'm storing the sales information for some vendors in sales collection. Each document in sales consist of sale value, vendor ID and date. I want to report that withing a specified time period find those vendors that have sold the most. To avoid aggregations I've created a middle collection that stores the whole amount of sales for each vendor in each day. Then when I want to prepare that report I just find those documents in the middle collection that their dates is in the specified time period and then group the results by their vendor ID and then sort the it. I think this solution would have less aggregation time because the documents in the middle collection are less than the original collection. Also it would be of O(n) time complexity.
I want to know is there any mechanism in MongoDB that makes it possible to avoid this aggregation too and make the query simpler?

MongoDb about 2 million entries in about 100 collections

I want to move customer surveys for different products and survey types into mongodb.
A product can have multiple survey types.
The existing data consists of about 2 million surveys and growing.
There will be a need of querying the data for stats and reports and the structure of the surveys and their questions can change over time. Which means that the documents wont always be the same.
What will suite the best:
One big collection with product_id and type overhead within one db
Multiple collections per product and type within one db
Or a mix of multiple dbs and collections for product and type
I read about advantages and disadvantages and also that every case has its own solution that suits the usage and purpose.
Unfortunately, I'm not sure what applies the best for my case.
It all depends on how you will access you data, it is by customer, survey or product?.
You can make a product collection and put the surveys as an array of subdocuments or you can make a customer collection and do the same thing.
It is not something we can help you with here without knowing the details of the business requirement.
Just keep in mind, MongoDB is schemaless and how you will design your documents and collections depends on how you will access your data.

What would be the best practice with multiple collections in mongodb

I need to build a SAS application, part is MySQL (containing customer information and data etc.) the other part is on MongoDB as it uses great ammounts of quite raw unrelated data, which is subject to MapReduce and Aggregation. It would be highly inefficient to use just one of the two, so I have to use them both. The scenario summary would be like this, I have a table in the MySQL db with customer accounts, under which I have actual users. The relevant one is the customer_id. Now, each customer has particular formatted data in the MongoDB side so I would require collection sets for each of the customers, example 31filters,31data,31logs where 31 would be the customer id, all these collections are to reside in the same mongodb database. Is this an acceptable approach or it would be better to actually have separate mongodbdatabases for each customer? What would be the best approach in terms of scalability?
Thank you

MongoDB Schema Design ordering service

I have the following objects Company, User and Order (contains orderlines). User's place orders with 1 or more orderlines and these relate to a Company. The time period for which orders can be placed for this Company is only a week.
What I'm not sure on is where to place the orders array, should it be a collection of it's own containing a link to the User and a link to the Company or should it sit under the Company or finally should the orders be sat under the User.
Numbers wise I need to plan for 50k+ in orders.
Queries wise, I'll probably be looking at Orders by Company mainly but I would need to find an Order by Company based for a specific user.
1) For folks coming from the SQL world (such as myself) one of the hardest learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
2) Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
How many I/O operations do you expect per second?
What you're talking about here is modeling Many-to-One relationships:
Company -> User
User -> Order
Order -> Order Lines
Company -> Order
Using SQL you would create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application, many of which you haven't provided.
3) This is my best guess - and it's only a guess - as to a good schema for you.
a) Have separate collections for Users, Companies, and Orders
If you're looking at 50k+ orders, there are too many to embed in a single document. Having them as a separate collection will allow you to reference them from both the Company and the User documents.
b) Have an array of references to the Order documents in both the Company and the User documents. This makes the query "Find all Orders for this Company" a single-document query
c) If your query pattern supports it, you might also have a duplicate link from Orders back to the owning Company and/or User.
d) Assuming that the order lines are unique to the individual Order, you would embed the Order Lines in an array within the Order documents.
e) If your order lines refer back to individual Products, you might want to have a separate Product collection, and include a reference to the Product document in the order line sub-document
4) Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
Note that the "MongoDB in Action" book includes a sample schema for an e-commerce application, which is very similar to what you're trying to build -- I recommend you check it out.

should I create one or many collections in mongodb in order to insert and search faster?

I am fairly new to mongodb. I am creating a web app which allows insert and search for multiple products such as laptop, hard driver, webcamm... My question is should I place all of them in a same collection such as "computer" or should I place each product in their own collection like "laptop", "hard driver", "webcam" so that when user search and insert for a product it will be faster ?
Thanks a lot
Generally speaking you should use one collection per "type" of thing you're storing. It sounds like all the examples you've given above would fall under a product "type" and should be in the same collection. Documents in the same collection need not all have the same fields, though for products you will probably have several fields in common across all documents: name, price, manufacturer, etc; each document "sub-type" might have several fields in common, like hard drives might all have RPM, storage capacity, form factor, interface (SATA2/3, IDE, etc).
The rationale for this advice is that MongoDB queries are performed on a single collection at a time. If you want to show search results that cover the different categories of products you have, then this is simple with one collection, but more difficult with several (and less performant).
As far as query performance is concerned, be sure to create indexes on the fields that you are searching on. If you allow search by product name or manufacturer, you would have an index on name, and another index on manufacturer. Your exact indexes will vary depending on the fields you have in your documents.
Insert speed, on the other hand, will be faster the fewer indexes you have (since each index potentially needs to be updated each time you save or update a document), so it is important not to create more indexes than you'll actually need.
For more on these topics, see the MongoDB docs on schema design and indexes, as well as the presentations on 10gen.com from 10gen speakers and MongoDB users.
I suggest to start with one collection. It is much simpler to search through the one collection rather then collection per product. And in the future if you queries against collection become slow you can start thinking about how to speed up them.
Mongodb has fairly fast indexes and it was design to be scalable, so once you will need scale your database -- replica sets and auto sharing in place.