How to store query output in temp db? - mongodb

I am really new to the programming but I am studying it. I have one problem which I don't know how to solve.
I have collection of docs in mongoDB and I'm using Elasticsearch to query the fields. The problem is I want to store the output of search back in mongoDB but in different DB. I know that I have to create temporary DB which has to be updated with every search result. But how to do this? Or give me documentation to read so I could learn it. I will really appreciate your help!

Mongo does not natively support "temp" collections.
A typical thing to do here is to not actually write the entire results output to another DB since that would be utterly pointless since Elasticsearch does its own caching as such you don't need any layer over the top.
As well, due to IO concerns it is normally a bad idea to write say a result set of 10k records to Mongo or another DB.
There is a feature request for what you talk of: https://jira.mongodb.org/browse/SERVER-3215 but no planning as of yet.
Example
You could have a table of results.
Within this table you would have a doc that looks like:
{keywords: ['bok', 'mongodb']}
Each time you search and scroll through each result item you would write a row to this table populating the keywords field with keywords from that search result. This would be per search result per search result list per search. It would probably be best to just stream each search result to MongoDB as they come in. I have never programmed Python (though I wish to learn) so an example in pseudo:
var elastic_results = [{'elasticresult'}];
foreach(elastic_results as result){
//split down the phrases in this result and make a keywords array
db.results_collection.insert(array_formed_from_splitting_down_result); // Lets just lazy insert no need for batch or trying to shrink the amount of data to one go or whatever, lets just stream it in.
}
So as you go along your results you basically just mass insert as fast a possible create a sort of "stream" of input to MongoDB. It can do this quite well.
This should then give you a shardable list of words and language verbs to process things like MRs on and stuff to aggregate statistics about them.
Without knowing more and more about your scenario this is pretty much my best answer.
This does not use the temp table concept but instead makes your data permanent which is fine by the sounds of it since you wish to use Mongo as a storage engine for further tasks.

Actually there is MongoDB river plugin to work with Elasticsearch...

db.your_table.find().forEach(function(doc) { b.another_table.insert(doc); } );

Related

How do I efficiently page a large collection of query results with Sails.js / Waterline?

I'm working with a large dataset behind the Waterline ORM. In several use-cases I need to do some processing on many/most of the record–10's of thousands.
So far I've been working with .find(), but that executes and returns the entire result set. Is there a Sails/Waterline approach to iterating over a query result–which preserves the storage-agnostic aspect of the ORM?
You can use paginate, something like -> Model.find().paginate({page: xx, limit: xx});
More info here: http://sailsjs.org/documentation/concepts/models-and-orm/query-language
Search for pagination :)
If you want to keep the storage agnostic waterline trait you will have to take a look to your actual schema implementation (even if you're coding storage agnostic).
You can:
Use pagination like #holzanic answers, however this might come up with critital performance issues in some storage technologies.
Use streams.
If you will be listing whole objects from a Model, you can make sure you can craft paginate by id. You can take first n elements in a query and then try to obtain the next page where their id attribute is bigger than last received in previous page.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.

Storing word frequency data

I am trying to store word frequency data using Mongo. Each word needs to be associated to a user so I can calculate how often an individual uses each word. Currently my words collection looks like this:
{'Hello':3, 'user_id':1}
Which obviously only works on a 'One To One' basis and is no good.
I am trying to work out how best to make this a 'One To Many' relationshop between the user and the words. Would I store the user relationship in my words collection like so:
{'word':"Hello", 'users':[{'id':1, 'count':4},{'id':2, 'count':10}]}
Or would I attach the word counts to the user collection instead?
{'id':1, 'username':'SomeUser', 'words':[{'Hello':4}]}
The obvious disadvantage to the second approach is that the same words will be used across different users, so having a single words collection would help to keeping the data size down.
Can anyone advise me as to what I should do here? Is there a method I have perhaps overlooked in the documentation?
The obvious disadvantage to the second approach is that the same words
will be used across different users, so having a single words
collection would help to keeping the data size down.
Nope, that's the nature of using document db. Data size is really not a matter in non sql solutions, important thing is how easy and how fast you can access your data.
Your first approach is a typical textbook relational model. There is no advantage of using this in mongo (Though you can model this in relational way in mongo). Instead the second approach gives you
Fatser reads/writes since every word is stored inside user. You dont need to perform multiple queries for this

How can I reduce Mongo db by averaging out old data

I have a mongodb for measurements which has a document per measurements. Each doc looks like:
{
timestamp : 123
value : 123
meta1 : something
meta2 : something
}
I get measurements from a number of sources every second, and so the db gets quite large, quickly. I'm interested in keeping the recent information at the frequency it was read in, but older data, i would like to average out periodically to save space, and make the db a bit quicker.
1.Whats the best approach in mongo?
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
1. Whats the best approach in mongo?
Use capped collections for use cases such as logging. Another approach is to create a 'background process' that will be move old data from collection.
2.Is there a better db for this, considering that the schema is different for different measurements, and a fixed format wouldn't work very well. RRD is also not an option as i need the dynamic query abilities.?
Mongodb is a good fit here.
Update:
Another approch is to store each data item twice: First in capped collection(and use this collection for quering). And create another collection(or even another logdb) just for logging your events.
Thanks for the input.
I think I'm going to try out using buckets for different timeframes. So, i'll create 3 stores corresponding to say 1sec, 1min, 15min, and then manage the aggregation through a manual job running every so often which will compact/average out the values, delete of stuff that's not needed, etc...
I'm not sure about the best approach but a simple one would be to have a cron job that would remove all the documents older than a given timestamp (your_time = now - some_time).
db.docs.remove({ timestamp : {'$lte' : your_time}})
Given that you need a schemaless database that allows you to perform dynamic queries, mondogb seems to be a good fit.

Hadoop Map/Reduce - simple use example to do the following

I have MySQL database, where I store the following BLOB (which contains JSON object) and ID (for this JSON object). JSON object contains a lot of different information. Say, "city:Los Angeles" and "state:California".
There are about 500k of such records for now, but they are growing. And each JSON object is quite big.
My goal is to do searches (real-time) in MySQL database.
Say, I want to search for all JSON objects which have "state" to "California" and "city" to "San Francisco".
I want to utilize Hadoop for the task.
My idea is that there will be "job", which takes chunks of, say, 100 records (rows) from MySQL, verifies them according to the given search criteria, returns those (ID's) which qualify.
Pros/cons? I understand that one might think that I should utilize simple SQL power for that, but the thing is that JSON object structure is pretty "heavy", if I put it as SQL schemas, there will be at least 3-5 tables joins, which (I tried, really) creates quite a headache, and building all the right indexes eats RAM faster than I one can think. ;-) And even then, every SQL query has to be analyzed to be utilizing the indexes, otherwise with full scan it literally is a pain. And with such structure we have the only way "up" is just with vertical scaling. But I am not sure it's the best option for me, as I see how JSON objects will grow (the data structure), and I see that the number of them will grow too. :-)
Help? Can somebody point me to simple examples of how this can be done? Does it make sense at all? Am I missing something important?
Thank you.
Few pointers to consider:
Hadoop (HDFS specifically) distributes data around a cluster of machines. Using MapReduce to analyze/process this data requires that the data is stored on the HDFS to make use of the parallel processing power Hadoop offers.
Hadoop/MapReduce is no where near real-time. Even when running on small amounts of data the time Hadoop takes to set-up a Job can be 30+ seconds. This is something that can't be stopped.
Maybe something to look into would be using Lucene to index your JSON objects as documents. You could store the index in solr and easily query on anything you want.
in fact you are.. because searching in a single huge field for text will take much more time than indexing the database and searching the proper sql way. The database was built to be used with sql and indexes, it does not have the capability to parse and index json, so whatever way you will find to search in the json (probably just hacky string matching) will be much slower. 500k rows is not that much to handle for mysql , you don't really need hadoop, just a good normalized schema , the right indices and optimized queries
Sounds like you are trying to recreate CouchDB. CouchDB is built with a map-reduce framework and is made to work specifically with JSON objects.