Is there someway where if I want to read some data from the database with certain constraints that instead of waiting to get all the results at once, the database can start "streaming" it's results to me.
Think of a large list.
Instead of making users wait for the entire list, I want to start filling is data quickly, even if I only get one row at a time.
I only know of MongoDB with limit(x) and skip(y).
Any way to get the streaming result from any database? I want to know for curiosity, and for a project I'm currently thinking about.
here's example of python connection to mongodb and getting data line by line
from pymongo import MongoClient
client = MongoClient()
db = client.blog
col = db.posts
for r in col.find():
print r
raw_input("press any key to continue...")
All standard MongoDB drivers return a cursor on queries (find() command), which allows your application to stream the documents by using the cursor to pull back the results on demand. I would check out the documentation on cursors for the specific driver that you're planning to use as syntax varies between different programming languages.
There's also a special type of cursor specific for certain streaming use cases. MongoDB has a concept of a "Tailable Cursor," which will stream documents to the client as documents are inserted into a collection (also see AWAIT_DATA option). Note that Tailable cursors only work on "capped collections" as they've been optimized for this special usage. Documentation can be found on the www.mongodb.org site. Below is a link to some code samples for tailable cursors:
http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/
Related
I'm currently experimenting with a test collection on a LAN-accessible MongoDB server and data in a Meteor (v1.6) application. View layer of choice is React and right now I'm using the createContainer to bind the subscriptions to props.
The data that gets put in the MongoDB storage is updated on a daily basis and consists of a big set of data from several SQL databases, netting up to about 60000 lines of JSON per day. The data has been ever-so-slightly reshaped to be turned into a usable format whilst remaining as RAW as I'd like it to be.
The working solution right now is fetching all this data and doing further manipulations client-side to prepare the data for visualization. The issue should seem obvious: each client is fetching a set of documents that grows every day and repeats a lot of work on earlier entries before being ready to display. I want to do this manipulation on the server, through MongoDB's Aggregation Framework.
My initial idea is to do the aggregations on the server and to create new Collections containing smaller, more specific datasets without compromising the RAWness of the original Collection. That would mean the "reduced" Collections can still be reactive, as I've been able to confirm through testing in a Remote Desktop, subscribing to an aggregated Collection which I can update through Robo3T.
I don't know if this would be ideal. As far as storage goes, there's plenty of room for the extra Collections. But I have no idea how to set up an automated aggregation script on said server. And regarding Meteor, I've tried using meteorhacks:aggregate and jcbernack:reactive-aggregate but couldn't figure out how to deal with either one of them. If anyone is dealing, or has dealt with, something similar; I'd love to hear ideas / suggestions.
I will be working on a large data set that changes slowly so I want to optimize the query result time by using a caching mechanism. For example , if I want to see some metrics about the data from the last 360 days I don't need to query the database again because I can reuse the last query result.
Does MongoDB natively support caching or do I have to use another database , for example Redis as mentioned here
EDIT : my question is different from Caching repeating query results in MongoDB because I asked about external caching systems and the response in the late question was specific to working with MongoDB and Tornado
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.
What I would like to do is to make a query against my Cassandra "table" and get not only the current matching data but any future data that's added.
I have an application where data is constantly added to the "table" and I have many "clients" that are interested in getting this data.
So the initial result of the query would be the current data that matches the client's query and then I would like ongoing data to be received as they are added. Each client may be making a different query.
I would prefer to have a callback registered with a query so that I receive the data w/o having to poll.
Is this even possible with Cassandra?
Thank you.
P.S. From my reading, it seems MongoDB does support this feature.
You can't do this in Cassandra at present, but the new triggers feature coming in Cassandra 2.0 may do what you need. It's only going to be experimental when 2.0 comes out (soon).
MongoDB does indeed have a feature that might fit the bill. It's called a "tailable cursor" and can only be used on a capped collection, i.e. a collection that works like a ring buffer and "forgets" old data. After the tailable cursor has exhausted the entire collection the next read attempt will block until new data becomes available.
You can convert this into a callback pattern easily by implementing a reader thread with which the rest of the application can register its callbacks.
When storing events in an event store, the order in which the events are stored is very important especially when projecting the events later to restore an entities current state.
MongoDB seems to be a good choice for persisting the event store, given its speed and flexibel schema (and it is often recommended as such) but there is no such thing as a transaction in MongoDB meaning the correct event order can not be garanteed.
Given that fact, should you not use MongoDB if you are looking for a consistent event store but rather stick with a conventional RDMS, or is there a way around this problem?
I'm not familiar with the term "event store" as you are using it, but I can address some of the issues in your question. I believe it is probably reasonable to use MongoDB for what you want, with a little bit of care.
In MongoDB, each document has an _id field which is by default in ObjectId format, which consists of a server identifier, and then a timestamp and then a sequence counter. So you can sort on that field and you'll get your objects in their creation order, provided the ObjectIds are all created on the same machine.
Most MongoDB client drivers create the _id field locally before sending an insert command to the database. So if you have multiple clients connecting to the database, sorting by _id won't do what you want since it will sort first by server-hash, which is not what you want.
But if you can convince your MongoDB client driver to not include the _id in the insert command, then the server will generate the ObjectId for each document and they will have the properties you want. Doing this will depend on what language you're working in since each language has its own client driver. Read the driver docs carefully or dive into their source code -- they're all open source. Most drivers also include a way to send a raw command to the server. So if you construct an insert command by hand this will certainly allow you to do what you want.
This will break down if your system is so massive that a single database server can't handle all of your write traffic. The MongoDB solution to needing to write thousands of records per second is to set up a sharded database. In this case the ObjectIds will again be created by different machines and won't have the nice sorting property you want. If you're concerned about outgrowing a single server for writes, you should look to another technology that provides distributed sequence numbers.
I am really new to the programming but I am studying it. I have one problem which I don't know how to solve.
I have collection of docs in mongoDB and I'm using Elasticsearch to query the fields. The problem is I want to store the output of search back in mongoDB but in different DB. I know that I have to create temporary DB which has to be updated with every search result. But how to do this? Or give me documentation to read so I could learn it. I will really appreciate your help!
Mongo does not natively support "temp" collections.
A typical thing to do here is to not actually write the entire results output to another DB since that would be utterly pointless since Elasticsearch does its own caching as such you don't need any layer over the top.
As well, due to IO concerns it is normally a bad idea to write say a result set of 10k records to Mongo or another DB.
There is a feature request for what you talk of: https://jira.mongodb.org/browse/SERVER-3215 but no planning as of yet.
Example
You could have a table of results.
Within this table you would have a doc that looks like:
{keywords: ['bok', 'mongodb']}
Each time you search and scroll through each result item you would write a row to this table populating the keywords field with keywords from that search result. This would be per search result per search result list per search. It would probably be best to just stream each search result to MongoDB as they come in. I have never programmed Python (though I wish to learn) so an example in pseudo:
var elastic_results = [{'elasticresult'}];
foreach(elastic_results as result){
//split down the phrases in this result and make a keywords array
db.results_collection.insert(array_formed_from_splitting_down_result); // Lets just lazy insert no need for batch or trying to shrink the amount of data to one go or whatever, lets just stream it in.
}
So as you go along your results you basically just mass insert as fast a possible create a sort of "stream" of input to MongoDB. It can do this quite well.
This should then give you a shardable list of words and language verbs to process things like MRs on and stuff to aggregate statistics about them.
Without knowing more and more about your scenario this is pretty much my best answer.
This does not use the temp table concept but instead makes your data permanent which is fine by the sounds of it since you wish to use Mongo as a storage engine for further tasks.
Actually there is MongoDB river plugin to work with Elasticsearch...
db.your_table.find().forEach(function(doc) { b.another_table.insert(doc); } );