Does MongoDB fail silently if I don't check error codes? - mongodb

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?

If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.

Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."

Related

Discrepancy between Redshift data api DescribeStatement status and console status

I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.
Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.
So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

How to handle client response during Transient Exception retrying?

Context
I'm developing a REST API that, as you might expect, is backed by multiple external cross-network services, APIs, and databases. It's very possible that a transient failure is encountered at any point and for which the operation should be retried. My question is, during that retry operation, how should my API respond to the client?
Suppose a client is POSTing a resource, and my server encounters a transient exception when attempting to write to the database. Using a combination of the Retry Pattern perhaps with the Circuit Breaker Pattern, my server-side code should attempt to retry the operation, following randomized linear/exponential back-off implementations. The client would obviously be left waiting during that time, which is not something we want.
Questions
Where does the client fit into the retry operation?
Should I perhaps provide an isTransient: true indicator in the JSON response and leave the client to retry?
Should I leave retrying to the server and respond with a message and status code indicative that the server is actively retrying the request and then have the client poll for updates? How would you determine the polling interval in that case without overloading the server? Or, should the server respond via a web socket instead so the client need not poll?
What happens if there is an unexpected server crash during the retry operation? Obviously, when the server recovers, it won't "remember" the fact that it was retrying an operation unless that fact was persisted somewhere. I suppose that's a non-critical issue that would just cause further unnecessary complexity if I attempted to solve it.
I'm probably over-thinking the issue, but while there is a lot of documentation about implementing transient exception retry logic, seldom have I come across resources that discuss how to leave the client "pending" during that time.
Note: I realize that similar questions have been asked, but my queries are more specific, for I'm specifically interested in the different options for where the client fits into a given retry operation, how the client should react in those cases, and what happens should a crash occur that interrupts a retry sequence.
Thank you very much.
There are some rules for retry:
always create an idempotency key to understand that there is retry operation.
if your operation a complex and you want to wrap rest call with retry, you must ensure that for duplicate requests no side effects will be done(start from failure point and don't execute success code).
Personally, I think the client should not know that you retry something, and of course, isTransient: true should not be as a part of the resource.
Warning: Before add retry policy to something you must check side effects, put retry policy everywhere is bad practice

When to use libmemcached in non-blocking mode?

I'm starting to integrate libmemcached into my application and reading the documentation, there is a non-blocking mode flag. After a quick google, there seems to be a performance advantage to non blocking mode, but are there any disadvantages to running libmemcached in non blocking mode?
Of course there are. The disadvantage would only arise if you needed to ENSURE that the written value actually was written to memcached and did not fail. For example - you're using memcached to store a counter variable which has a sentinel that checks to see if the counter has reached a certain value before performing an operation.
In blocking mode - the memcached client will wait to get a write success response from memcached before proceeding and produce an error if it fails. This way you know the counter was updated. If you tell it to write in non-blocking mode, the client sends the request to increment the counter, but never waits to ensure that it really occurred. Because it doesn't wait, you code execution after the call resumes more quickly, but with the uncertainty of not knowing for sure the counter was ever incremented.
However, since memcached values are destroyed on a service restart (think system crash) you can't ever really be sure a value will be there. Also, with low-memory pruning you also cannot ever be sure the value is 100% correct as it may get pruned by the LRU algorithm - you'd need persistent storage to alleviate this uncertainty.
Given this inherent uncertainty, many people use non-blocking mode to get the performance gain because they can't ever be totally certain the counter value in memcached isn't reset/innacurate anyway, so why not get some performance for the tradeoff.
Hope this clarifies the issue. As a side note - MongoDB has non-blocking writes in persistent storage - which while awesome in its flexibility, gives people using non-blocking mode more of a false sense of security that the write will always succeed...
R

Memcache-based message queue?

I'm working on a multiplayer game and it needs a message queue (i.e., messages in, messages out, no duplicates or deleted messages assuming there are no unexpected cache evictions). Here are the memcache-based queues I'm aware of:
MemcacheQ: http://memcachedb.org/memcacheq/
Starling: http://rubyforge.org/projects/starling/
Depcached: http://www.marcworrell.com/article-2287-en.html
Sparrow: http://code.google.com/p/sparrow/
I learned the concept of the memcache queue from this blog post:
All messages are saved with an integer as key. There is one key that has the next key and one that has the key of the oldest message in the queue. To access these the increment/decrement method is used as its atomic, so there are two keys that act as locks. They get incremented, and if the return value is 1 the process has the lock, otherwise it keeps incrementing. Once the process is finished it sets the value back to 0. Simple but effective. One caveat is that the integer will overflow, so there is some logic in place that sets the used keys to 1 once we are close to that limit. As the increment operation is atomic, the lock is only needed if two or more memcaches are used (for redundancy), to keep those in sync.
My question is, is there a memcache-based message queue service that can run on App Engine?
I would be very careful using the Google App Engine Memcache in this way. You are right to be worrying about "unexpected cache evictions".
Google expect you to use the memcache for caching data and not storing it. They don't guarantee to keep data in the cache. From the GAE Documentation:
By default, items never expire, though
items may be evicted due to memory
pressure.
Edit: There's always Amazon's Simple Queueing Service. However, this may not meet price/performance levels either as:
There would be the latency of calling from the Google to Amazon servers.
You'd end up paying twice for all the data traffic - paying for it to leave Google and then paying again for it to go in to Amazon.
I have started a Simple Python Memcached Queue, it might be useful:
http://bitbucket.org/epoz/python-memcache-queue/
If you're happy with the possibility of losing data, by all means go ahead. Bear in mind, though, that although memcache generally has lower latency than the datastore, like anything else, it will suffer if you have a high rate of atomic operations you want to execute on a single element. This isn't a datastore problem - it's simply a problem of having to serialize access.
Failing that, Amazon's SQS seems like a viable option.
Why not use Task Queue:
https://developers.google.com/appengine/docs/python/taskqueue/
https://developers.google.com/appengine/docs/java/taskqueue/
It seems to solve the issue without the likely loss of messages in Memcached-based queue.
Until Google impliment a proper job-queue, why not use the data-store? As others have said, memcache is just a cache and could lose queue items (which would be.. bad)
The data-store should be more than fast enough for what you need - you would just have a simple Job model, which would be more flexible than memcache as you're not limited to key/value pairs