NoSQL recommendation for a project (text streaming) - nosql

I'm looking for a NoSQL DB recommendation... here's what I'm working on:
I'm writing a web-based client for delivering text streams (basically, real-time captions) to a significant number of consumers. Once things are fully ramped up, there might be 100+ events happening at any given moment. Many will be small (< 10 consumers) but some of them could be quite large (10,000+ simultaneous consumers, maybe more?).
During the course of each event, text will be accumulating at a rate of anywhere from a few words per minute up to 200+ words per minute. Each consumer will be running a web client (a browser on a desktop/laptop/tablet/smartphone) which will poll periodically for any text that it hasn't already received. It will also be possible for a given user to ask for the full text of the event up to the time that they make the request. Completed events have to stick around for a while, but will be removed within about 24-36 hours of their completion.
My first thought is to use Redis, which has methods for appending to a text value in the datastore as well as built-in support for getting a substring from the end of a text value (i.e. a client could just hold the character offset of the last character it received and would pass that to the client API and that would be used to pull a substring from the event text). I am concerned though that the growth of the string containing the event text might be an unusual use of Redis and could cause me some issues.
So... is there a NoSQL DB that seems particularly well suited to this sort of application? Is there any significant reason NOT to use Redis for something like this?

An underlying open question is what to do about new clients. For example, say an event has started and someone connects a few minutes into it. Do they need everything from the beginning or just from when they connected?
If the latter I'd recommend a message system instead of appending strings to strings. One way would be to use Redis' Pub/Sub instead. That seems a better fit overall, and especially if new connections do not need everything from the beginning. For longer term storage, a client that listens as any other and the archive entry - preferably by local cache and then upload the completed transcript when completed or in-progress. I'd keep the real-time need and code separate from requesting history and archives.
Another route would be to use an ordered set, using a timestamp for the time the entry was made. As a result the client only keeps track of the last update and retrieves anything from that time on. Ordered Sets documentation can be found here. This method also provides the ability to select a region of time from the transcript. With a bit of math you could even replay the event from a transcript viewpoint as if it were live. If you've got tens of thousands of clients pulling the entire transcript each poll
Another advantage of the timestamp ordered set is string encoding. When using Redis strings and getrange you have to use fixed-width encodings. The range is byte-offsets, not character offsets. If you need the ability to support, say UTF-8, this might be a problem for you.
A third option is to append a string of text to a list. This is similar to the sorted set except that your client stores the last index (size of the list) and on each poll tries to get anything from lastIndex+1 to the end.

Related

MarkLogic REST interface to send data to Qlik Sense

I need to present ~10 million XML documents to Qlik Sense using MarkLogic REST interface with the intention of analyzing raw data on Qlik.
I'm unable to send that bulk data using simple cts:search.
A template view with SQL call like below is not helping as it is not recognized at Qlik Sense.
xdmp:to-json(xdmp:sql('select * from SC1.V1'))
Is there a better way to achieve this?
I understand it is not usual to load such huge data to Qlik, but what limitations should I consider?
You are unlikely to be able transfer that volume of data into or out of ANY system in a single 'transaction' (or request ). And if you could you wouldn't want to because when it fails, it's likely to fail forever as you have to start all over.
You should 'batch' up the documents into manageable chunks .. 100MB or '1 minute' is a reasonable high upper bound -- as size and time increase the probability of problems goes up (way up) due to timeouts, memory, temp space, internet and network transient problems etc.
A simple strategy that often works well is to first produce a 'list' of what to extract (document uris, primary keys ..), save that, and then work your way through the list in batches - retrying as needed. Depending on the destination and local storage etc. you can either combine the lot to send on to the recipient, or generally better, send the target data in batches as well.
This approach has good transactional characteristics ... you effectively 'freeze' the set of data when you make the list, but can take your time collecting and sending it. Depending -- you may be able to do so in parallel.

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)
I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.
I can easily ship the logs to PipelineDB using copy from stdin, which works great.
However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.
How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)
Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows
Nonetheless, are there any alternative solutions to this issue?
Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.
That being said, there are a couple of approaches you could use here that may be worth thinking about.
First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.
Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

Simulating an Oracle sequence with MongoDB

Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

How should I implement "get objects changed since" pattern with MongoDB?

I have a collection of objects, let's say they are "posts," and those objects can be modified. I'd like to display a list on the client side that updates dynamically. So on the client side, if doing this via polling, the client would invoke an API like:
getPostsChangedSince(serial)
where serial could be a monotonically increasing number, probably a timestamp. The client gets back a list of posts that have changed since that time, stores a new latest-serial, and next time the client polls it requests changes since that latest serial.
I think the basic idea is the same in this question (which is about ASP.NET): How to implement "get latests changed items" with ADO.NET Data Services?
I'm trying to find the best way to implement this in MongoDB.
I like the idea of using the time for the serial, since it automatically works at least mostly correctly even if there are multiple app servers. The serial would be stored in each post object, and updated whenever the object is modified.
The timestamp-based serial could be implemented as:
a Date (I think this is stored as a 64-bit milliseconds since epoch?)
a Timestamp http://www.mongodb.org/display/DOCS/Timestamp+Data+Type
something "by hand" e.g. store milliseconds as a number
Some nice features to have in a solution would include:
ensure that creating then immediately updating an object within the OS timer resolution will still increment the serial despite it being the same time
even better would to be guaranteed monotonic increase globally for all objects, not just guarantee that changing a given object will bump the serial on that object (absent this, getPostsChangedSince() calls probably need a fuzz backward in time, to avoid missing changes - at price of getting some changes twice)
mongodb-side timestamps might be nice because getting the time in the app creates a gap between when you get the time, and when the new object is saved and available in queries
update using findAndModify() with a query including the old serial, so "conflicts" (two changes at once) will throw an error allowing the app to retry
I realize some of the corner cases here are a little bit "academic" and can likely be fudged around in real life.
My approach so far is:
use the Date type for the serial
when modifying an object, get the current time, and if it matches the object's old serial, add 1 millisecond (yes this breaks if you make two modifications quickly without re-fetching from mongodb, but that seems OK)
use findAndModify(), but based on https://jira.mongodb.org/browse/JAVA-276 there may not be a way to detect if it ends up not finding anything to modify (i.e. second change is ignored, in case of conflict)
Questions:
I feel like I should use Timestamp instead; true? Any downsides?
if you had a mongo cluster, might time in milliseconds be more unique and correct than Timestamp's time in seconds plus a number, while with one mongod Timestamp is more unique?
is there a way to detect whether findAndModify() updated anything?
any general advice / experiences with this problem? how would you do it?
Have you considered "externalizing" the serial number generator? Time with MongoDB precision is good, but can become difficult to synchronize when involving multiple machines. One choice is that you can use memcached or something similar which is memory based, extremely fast and can be serialized (memcached has a CAS operation).
So what you would do is store a "seed" in memcached with a key say, counter.
Everytime an app needs to do an insert, it gets the next number from memcached and increments the counter.
On second thoughts, you can even do away with memcached and just use a single row (sorry document) collection that just has the counter. You can get the counter and increment it which will be an extremely fast operation, mimicking memcached.
And then naturally, you can index the data appropriately. However, I am wondering that this would result in the index to be very imbalanced (right-side loped). Depending upon the situation, it might be worthwhile exploring the use of capped collection. So when you insert data into your main collection, also insert it into the capped collection and read data from that collection.
You could continue to use your regular collection, as you do now, and after each update additionally insert the ID of the post into a special TTL collection. See http://docs.mongodb.org/manual/tutorial/expire-data/ for more info on using such a collection. Mongo will take care of all timing issues, you don't need to worry about serial numbers, and you can very quickly access time based lists of objects by their IDs.
Caveat:
use the blocking form of findAndModify, to ensure the changes have really been processed:
Blocking/Safe Writes
Unless you specify the "new" parameter as true the write operation will not block, and will not return an error (if there is one). If you do want the "new" document returned then the operation will wait until the write is done to return the new document, or an error.
For a "safe" (blocking) write operation you must call getLastError (if not using "new").