Find query inside MongoDB reduce - mongodb

I have two collections : one with the raw data from devices, one with the device configuration
I use map reduce to treat the raw data from devices. I need to get the configuration parameters inside the reduce step.
Is it possible ?
Thanks in advance
UPDATE :
I have a lot of data to treat 400000 documents per day and around 4000 configuration documents.
Thus I have to make a join between the two collections and inject that in map/reduce ?
Is this the best way to do ?

No. Map/reduce should always use the collection data it is invoked on. If you need the configuration data you will have to make sure it is inside your raw device data before invoking the map/reduce. Since m/r is just JavaScript being executed server-side so it is technically possible to query other collections but it can break (sharding comes to mind).

Related

Handling DB query "IN" with a list of values exceeding DB capacity

I am querying a CosmosDB with a huge list of ids, and i get an exception saying i have exceeded the permissible limit of 256 characters.
What is the best way to handle such huge queries ?
The only way i can think of is to split the list and execute in batches.
Any other suggestions ?
If you're querying data this way then your model is likely not optimal. I would look to remodel your data such that you can query on another property shared by the items you are looking for (in partition as well too).
Note that this could also be achieved by using Change Feed to copy the data into another container with a different partition key and a new property that groups the data together. whether you do this will depend on how often you run this query and whether this is cheaper than running this query in multiple batches.

Firestore pagination of multiple queries

In my case, there are 10 fields and all of them need to be searched by "or", that is why I'm using multiple queries and filter common items in client side by using Promise.all().
The problem is that I would like to implement pagination. I don't want to get all the results of each query, which has too much "read" cost. But I can't use .limit() for each query cause what I want is to limit "final result".
For example, I would like to get the first 50 common results in the 10 queries' results, if I do limit(50) to each query, the final result might be less than 50.
Anyone has ideas about pagination for multiple queries?
I believe that the best way for you to achieve that is using query cursors, so you can better manage the data that you retrieve from your searches.
I would recommend you to take a look at the below links, to find out more information - including a question answered by the community, that seems similar to your case.
Paginate data with query cursors
multi query and pagination with
firestore
Let me know if the information helped you!
Not sure it's relevant but I think I'm having a similar problem and have come up with 4 approaches that might be a workaround.
Instead of making 10 queries, fetch all the products matching a single selection filter e.g. category (in my case a customer can only set a single category field). And do all the filtering on the client side. With this approach the app still reads lots of documents at once but at least reuse these during the session time and filter with more flexibility than firestore`s strict rules.
Run multiple queries in a server environment, such as cloud store functions with Node.js and get only the first 50 documents that are matching all the filters. With this approach client only receives wanted data not more, but server still reads a lot.
This is actually your approach combined with accepted answer
Create automated documents in firebase with the help of cloud functions, e.g. Colors: {red:[product1ID,product2ID....], ....} just storing the document IDs and depending on filters get corresponding documents in server side with cloud functions, create a cross product of matching arrays (AND logic) and push first 50 elements of it to the client side. Knowing which products to display client then handle fetching client side library.
Hope these would help. Here is my original post Firestore multiple `in` queries with `and` logic, query structure

How to overcome the limitations with mongoDB aggregation framework

The aggregation framework on MongoDB has certain limitations as per this link.
I want to remove the restrictions 2, 3.
I really do not care what the resulting set's size is. I have a lot of RAM and resources.
And I do not care if it takes more than 10% system resources.
I expect both 2, 3 to be violated in my application. Mostly 2.
But I really need the aggregation framework. Is there anything that I can do to remove these limitations?
The reason *
The application I have been working has these things
The user has the ability to upload a large dataset
We have a menu to let him sort, aggregate etc
The aggregate has no restrictions currently and the user can choose to do whatever he wants. Since the data is not known to the developer and since it is possible to group by any number of columns, the application can error out.
Choosing something other than mongodb is a no go. We have already sunk too much into development with MongoDB
Is it advisable to change the source code of Mongo?
1) Saving aggregated values directly to some collection(like with MapReduce) will released in future versions, so first solution is just wait for a while :)
2) If you hit 2-nd or 3-rd limitation may you should redesign your data scheme and/or aggregation pipeline. If you working with large time series, you can reduce number of aggregated docs and do aggregation in several steps (like MapReduce do). I can't say more concretely, because I don't know your data/use cases(give me a comment).
3) You can choose different framework. If you familiar with MapReduce concept, you can try Hadoop(it can use MongoDB as data source). I don't have experience with MongoDB-Hadoop integration, but I mast warn you NOT to use Mongo's MapReduce -- it sucks hard on large datasets.
4) You can do aggregation inside your code, but you should use some "lowlevel" language or library. For example, pymongo (http://api.mongodb.org/python/current/) is not suitable for such things, but you can tray something like monary(https://bitbucket.org/djcbeach/monary/wiki/Home) to efficiently extract date and NumPy or Pandas to aggregate it the way want.

Easiest way to scale Mongo with limited resources?

I have a web server (40gig hd, 1 gig ram) that runs Mongo and a Rails application.
The Mongo DB is a document store of Twitter tweets and users, which has several million records. I perform map-reduce queries on the data to extract things like most common hashtags, words, mentions etc (very standard stuff). The meta-data of each tweet is already stored, so the map-reduce is really as efficient as a single collect.
However, since it is run on a (fairly) large dataset, it can't be done in real-time anymore - for example I have a report generator that works out a whole bunch of these map-reduces in a row and takes about 2 minutes for 20 thousand tweets.
What is the quickest, cheapest way to scale mongo, especially in map-reduce performance? I can set up an additional server and split the load, but wonder if I should use sharding, replication or both? Sharding may be overkill for this situation.
Would love some input on my mysql-mongo connection. mysql contains twitter profiles that store twitter ids for each profile. each time a map reduce is done, it collects all IDs to be fed as options into the mapreduce ie:
#profile_tweet_ids = current_profile_tweet_ids # array of ids
#daily_trend = TwitterTweet.daily_trend :query => {:twitter_id => {"$in" => #profile_tweet_ids}}
The mapreduce function in TwitterTweet looks like:
def daily_trend(options={})
options[:out] = "daily_trend"
map = %Q( function(){
if (this.created_at != null)
{
emit(this.created_at.toDateString(), 1);
}
})
result = collection.map_reduce(map, standard_reduce, options)
normalize_results(result)
end
Any advice is appreciated!
If you are doing simple counts, sums, uniques etc, you may be able to avoid using map-reduce completely. You can use the $inc operator to get most of the stuff that you need in real-time.
I have explained this in detail in my blog post on real-time analytics with MongoDB.
Sounds like your use case is more in the lines of online stream / event processing.
You can use mongo or other databases / caching product to store reference data, and an event processing framework for receiving and processing the events. There are a few tools that can help you with that - out the back of my head here are a few: Twitter Storm, Apache S4, GigaSpaces XAP (disclaimer - I work for GigaSpaces) and GridGain.
Use one of the cloud services like MongoLab.. Depends on your definition of cheap though..
The answer regarding using operators rather than MapReduce has merits, and may be far more beneficial to your efforts to get real time responses. Map Reduce on mongodb does not lend itself to yielding real time responses.
Further to that, you may also benefit from the new aggregation framework (http://www.mongodb.org/display/DOCS/Aggregation+Framework), once that is available in the next release.
To answer the more general question about how to scale out MapReduce, adding a new server may not help if you are simply going to add it as a secondary, as a secondary it will not have the capability to store your M/R results in a collection, so inline is your only option. If you do not need to store results in a collection then this is your easiest way forward. For more information, see an in-depth discussion here: http://groups.google.com/group/mongodb-user/browse_thread/thread/bd8f5734dc64117a
Sharding can help with scaling out, but bear in mind that you will need to run everything through a mongos process, have config servers and that the mongos will need to finalize the result sets returned from each shard, so you add a new potential bottleneck depending on your data and you will need more than just one extra machine to have it working in a reliable manner.
It is the connections between different data items that is most valuable to them (they let the public do the work of categorising the data to make it valuable) and hence also the most dangerous to you http://indresult.com

How can I reduce Mongo db by averaging out old data

I have a mongodb for measurements which has a document per measurements. Each doc looks like:
{
timestamp : 123
value : 123
meta1 : something
meta2 : something
}
I get measurements from a number of sources every second, and so the db gets quite large, quickly. I'm interested in keeping the recent information at the frequency it was read in, but older data, i would like to average out periodically to save space, and make the db a bit quicker.
1.Whats the best approach in mongo?
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
1. Whats the best approach in mongo?
Use capped collections for use cases such as logging. Another approach is to create a 'background process' that will be move old data from collection.
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
Mongodb is a good fit here.
Update:
Another approch is to store each data item twice: First in capped collection(and use this collection for quering). And create another collection(or even another logdb) just for logging your events.
Thanks for the input.
I think I'm going to try out using buckets for different timeframes. So, i'll create 3 stores corresponding to say 1sec, 1min, 15min, and then manage the aggregation through a manual job running every so often which will compact/average out the values, delete of stuff that's not needed, etc...
I'm not sure about the best approach but a simple one would be to have a cron job that would remove all the documents older than a given timestamp (your_time = now - some_time).
db.docs.remove({ timestamp : {'$lte' : your_time}})
Given that you need a schemaless database that allows you to perform dynamic queries, mondogb seems to be a good fit.