What I would like to do is to make a query against my Cassandra "table" and get not only the current matching data but any future data that's added.
I have an application where data is constantly added to the "table" and I have many "clients" that are interested in getting this data.
So the initial result of the query would be the current data that matches the client's query and then I would like ongoing data to be received as they are added. Each client may be making a different query.
I would prefer to have a callback registered with a query so that I receive the data w/o having to poll.
Is this even possible with Cassandra?
Thank you.
P.S. From my reading, it seems MongoDB does support this feature.
You can't do this in Cassandra at present, but the new triggers feature coming in Cassandra 2.0 may do what you need. It's only going to be experimental when 2.0 comes out (soon).
MongoDB does indeed have a feature that might fit the bill. It's called a "tailable cursor" and can only be used on a capped collection, i.e. a collection that works like a ring buffer and "forgets" old data. After the tailable cursor has exhausted the entire collection the next read attempt will block until new data becomes available.
You can convert this into a callback pattern easily by implementing a reader thread with which the rest of the application can register its callbacks.
Related
I am attempting to figure out a solid strategy for handling schema changes in Firestore. My thinking is that schema changes would often require reading and then writing to every document in a collection (or possibly documents in a different collection).
Here are my concerns:
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
Should I be using batched writes?
Also, feel free to tell me if you think this is the complete wrong approach to implementing schema changes, and suggest a better solution.
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
If the number of documents gets too large to handle in a single query, you can start paginating the results.
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
That's impossible to say at this moment.
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
If you need the existing contents of a document to determine its new contents, then you'll indeed need to read it. If you don't need the existing contents, all you need is the path, and you can consider using the Node.js API to only retrieve the document IDs.
Should I be using batched writes?
Batched writes have no performance advantages. In fact, they're often slower than sending the individual update calls in parallel from your code.
I'm currently experimenting with a test collection on a LAN-accessible MongoDB server and data in a Meteor (v1.6) application. View layer of choice is React and right now I'm using the createContainer to bind the subscriptions to props.
The data that gets put in the MongoDB storage is updated on a daily basis and consists of a big set of data from several SQL databases, netting up to about 60000 lines of JSON per day. The data has been ever-so-slightly reshaped to be turned into a usable format whilst remaining as RAW as I'd like it to be.
The working solution right now is fetching all this data and doing further manipulations client-side to prepare the data for visualization. The issue should seem obvious: each client is fetching a set of documents that grows every day and repeats a lot of work on earlier entries before being ready to display. I want to do this manipulation on the server, through MongoDB's Aggregation Framework.
My initial idea is to do the aggregations on the server and to create new Collections containing smaller, more specific datasets without compromising the RAWness of the original Collection. That would mean the "reduced" Collections can still be reactive, as I've been able to confirm through testing in a Remote Desktop, subscribing to an aggregated Collection which I can update through Robo3T.
I don't know if this would be ideal. As far as storage goes, there's plenty of room for the extra Collections. But I have no idea how to set up an automated aggregation script on said server. And regarding Meteor, I've tried using meteorhacks:aggregate and jcbernack:reactive-aggregate but couldn't figure out how to deal with either one of them. If anyone is dealing, or has dealt with, something similar; I'd love to hear ideas / suggestions.
Is there someway where if I want to read some data from the database with certain constraints that instead of waiting to get all the results at once, the database can start "streaming" it's results to me.
Think of a large list.
Instead of making users wait for the entire list, I want to start filling is data quickly, even if I only get one row at a time.
I only know of MongoDB with limit(x) and skip(y).
Any way to get the streaming result from any database? I want to know for curiosity, and for a project I'm currently thinking about.
here's example of python connection to mongodb and getting data line by line
from pymongo import MongoClient
client = MongoClient()
db = client.blog
col = db.posts
for r in col.find():
print r
raw_input("press any key to continue...")
All standard MongoDB drivers return a cursor on queries (find() command), which allows your application to stream the documents by using the cursor to pull back the results on demand. I would check out the documentation on cursors for the specific driver that you're planning to use as syntax varies between different programming languages.
There's also a special type of cursor specific for certain streaming use cases. MongoDB has a concept of a "Tailable Cursor," which will stream documents to the client as documents are inserted into a collection (also see AWAIT_DATA option). Note that Tailable cursors only work on "capped collections" as they've been optimized for this special usage. Documentation can be found on the www.mongodb.org site. Below is a link to some code samples for tailable cursors:
http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/
Currently I use Mongodb for recording statistics and adserving. I log raw impressions to a log collection, and processes' do findandmodify to pull off the log and aggregate into a precomputed collection using upsert (similar to how rainbird works with twitter).
http://techcrunch.com/2011/02/04/twitter-rainbird/
I aggregate on the parent, child, childs child etc, which makes querying for statistics fast and painless.
I use (in mongo) a key consisting of the {Item_id, Hour} and upsert to that (alot)
I was wondering if Riak had a strong way to solve the same problem, and how I would implement it.
Short answer: I don't think Riak supports upsert-like operations.
Long answer: Riak is a Key-Value store which treats stored values as opaque data. But in the future Riak could consider adding support for HTTP PATCH which might allow one to support operations similar to upsert. There is another category of operations (compare-and-set) which would also be interesting, but supporting these is definitely much more complicated.
The way this works with Riak depends on the storage backend that you're using for Riak.
Bitcask, the current default storage backend, uses a log-structured hash tree for the internal storage mechanism. When you write a new record to Riak, an entirely new copy of your data is stored on disk. Eventually, compaction of the bitcask will occur and the old copies of your data will be removed from the bitcask file.
Any put into Riak is effectively an upsert - if the data doesn't exist, a new record will be inserted. Otherwise, the existing value will be updated by expiring the old value and making the newest value the current value.
I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.