How to avoid MongoDB Aggregation in a specific query? - mongodb

I'm using MongoDB and need some analytics query to produce some report of my data. I know that MongoDB is not a good choice for OLAP applications but if I want to use MonogoDB one solution could be pre-computing the required data. For instance we can create new collections for the any specific OLAP query and just update that collection when some related events happen in system. Consider this scenario:
In my app I'm storing the sales information for some vendors in sales collection. Each document in sales consist of sale value, vendor ID and date. I want to report that withing a specified time period find those vendors that have sold the most. To avoid aggregations I've created a middle collection that stores the whole amount of sales for each vendor in each day. Then when I want to prepare that report I just find those documents in the middle collection that their dates is in the specified time period and then group the results by their vendor ID and then sort the it. I think this solution would have less aggregation time because the documents in the middle collection are less than the original collection. Also it would be of O(n) time complexity.
I want to know is there any mechanism in MongoDB that makes it possible to avoid this aggregation too and make the query simpler?

Related

What is the best way to archive history data in mongo

I have a collection in mongo that stores every user action of my application, and its very huge in size (3Million documents per day). On UI I have a requirement to show the user actions for max. 6months period.
And the queries on this collection are becoming very slow with all the historic data, though there are indexes in place. So, I want to move the documents that are older than 6months to a separate collection.
Is it the right way to handle my issue?
Following are some of the techniques you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using mulitple collections for months
Using different databases on same host

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

Analytical Queries with MongoDB

I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.
MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.

Using NoSQL on data with few relations

I'm currently planning the development of a service which should handle a fair amount of request and for each request do some logging.
Each log will have the following form
{event: "EVENTTYPE", userid: "UID", itemid: "ITEMID", timestamp: DATETIME}
I expect that a lot of writing will be done, while reading and analysis will only be done once per hour.
A requirement in the data analysis is that I have to be able to do the following query:
Are both events, A and B, on item (ITEMID) logged for user (UID)? (Maybe even tell if event A came before event B based on their timestamps)
I have thought about MongoDB as my storage solution.
Can the above query be (properly) carried out by the MongoDB aggregation framework?
In the future I might add on to the analysis step, with a relation from ITEMID to ITEM.Categories (I have a collection of items, and each item has a series of categories). Possibly it would be interesting to know how many times event A occured on items grouped by the individual items category, during the last 30 days. Will MongoDB then be a good fit for my requirements?
Some information about the data I'll be working with:
I expect to be logging in the order of 10.000 events a day on average.
I haven't decided yet, whether the data should be stored indefinitely.
Is MongoDB a proper fit for my requirements? Is there another NoSQL database that will handle my requirements better? Is NoSQL even usable in this case or am I better off sticking with relational databases?
If my requirement of the frequency of analysis changes, say from once an hour to real time. I believe Redis would serve my purpose better than MongoDB, is this correctly understood?
Are both events, A and B, on item (ITEMID) logged for user (UID)? (Maybe even tell if event A came before event B based on their timestamps)
Can the above query be (properly) carried out by the MongoDB aggregation framework?
Yes, absolutely. You can use the $group operator to aggregate events by ITEMID, UID, you can filter results before the grouping via $match to limit them to a specific time period, or with any other filter, and you can push times (first, last) of each type of event into the document that the $group operator creates. Then you can use $project to create a field indicating what came before what, if you wish.
All of the capabilities of aggregation framework are well outlined here:
http://docs.mongodb.org/manual/core/aggregation-pipeline/
In the future I might add on to the analysis
step, with a relation from ITEMID to ITEM.Categories (I have a
collection of items, and each item has a series of categories).
Possibly it would be interesting to know how many times event A
occured on items grouped by the individual items category, during the
last 30 days. Will MongoDB then be a good fit for my requirements?
Yes. Aggregation in MongoDB allows you to $unwind arrays so that you can group things by categories, if you wish. All of the things you've described are easy to accomplish with aggregation framework.
Whether or not MongoDB is the right choice for your application is outside the scope of this site, but you the requirements you've listed in this question can be implemented in MongoDB.

Create aggregated user stats with MongoDB

I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.