Unexpected behaviour after memcached server restarts. How to configure/rectify it? - memcached

I have a pool of persistent connections(Memcached clients). Data are being cached in the memcached server. If after restarting the memcached server, I try to get the cached data using the client from the pool, I m getting the below exception:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancelled
at net.spy.memcached.MemcachedClient$OperationFuture.get(MemcachedClient.java:1662)
at net.spy.memcached.MemcachedClient$GetFuture.get(MemcachedClient.java:1708)
at com.eos.gds.cache.CacheClient.get(CacheClient.java:49)
I get this exception only for the first time after the restart when I try to get the cached data. I did a lot of search. But unable to find the exact reason for this.

Spymemcached has a bunch of internal queues that operations are placed in before they are actually sent out to memcached. What is happening here is that you do an operation and then before that operation is sent over the wire or before a response is received from memcached, Spymemcached realizes that the connection has been lost. As a result Spymemcached cancels all operations in flight and then reestablishes the connection.
When you call get() on the Future then since the operation was cancelled by Spymemcached an exception is thrown. What I recommend doing here is catching all exceptions on every individual operation you do with Spymemcached and then, depending on the error, either retrying the operation of just forgetting about it. If it's a get for example and your cluster of memcached servers goes down then you can probably forget about it since the cache will be empty, but you will probably want to retry a set.

I ran into the exact same problem and fix it by handling the exception until success
while(true){
try{
memcacheclient.get(key);
break;
}
catch(java.util.concurrent.CancellationException e ){
log.info("cache cancelled");
}
}

Run MemcachedClient.getStats() for each new client once and that will fix the cancelations issue.

I had the same issue. I am using Spymemcached client to connect with Memcache server.
I found .
There must be a connection issue.
Ref: https://github.com/couchbase/spymemcached/blob/master/src/main/java/net/spy/memcached/internal/OperationFuture.java

Been searching for days for a solution. Posting in case it helps someone else.
Our implementation of ServletContextListener was getting a new MemcachedClient(...) on contextInitialized, and would then call the MemacachedClient method shutdown() on contextDestroyed. I would always get a CancellationException or ExecutionException on the first request I would send. (The error messaging alluded to both, but an ExecutionException is what I was able to catch.)
Solution: switched from shutdown() to shutdown(1, TimeUnit.SECONDS)
Now the get call succeeds the very first time that it is run.
I cannot explain for sure how the contextDestroyed call was interfering with the regular handling of the request. My best guess is that spymemcached's single thread somehow gets shared between servlets, and so when a servlet was created to handle a request sent by a verification step of our build process, it would get destroyed prior to the first request I would send, and the MemcachedClient my request's servlet was using would then try to use that same thread and get hit with the exceptions from the shutdown.
(Our team had established the need to call shutdown a while back when we learned our web app had too many open connections to our memcached server.)

Related

How to properly handle race condition caused by retry worker

In one of the services we had some connection issues and we are getting random timeouts (we think it is because of the client library. it is one of the caching services). We decided to handle it by putting it in the queue and retrying on a separate worker until we solve the underlying issue.
However, there is a case. let's say we want to put the value "A" to cache. but it fails. so we put it in the queue to retry again. but during this time user fire a delete request to remove that data and we call it without any timeouts (no error, but no record to delete as well). then our retry strategy writes that data to cache (which is supposed to be deleted and not be there).
How would we handle this scenario? I first thought maybe we can raise an error if delete doesn't delete anything but then I see it also has so many complications and can end with an endless retry even
It appear as the issue is coming as you are doing actual action on main thread and if it fails then only doing retry through queue by worker thread.
If you do actual action as well through worker thread as well through queue then issue will be resolved.
Or 2nd solution is, you can track all the keys that are in queue for retry. If there is any action related to key already in queue then queue the actual action as well. Like delete should be queue as the action for A as retry action on A is already queue.
2nd solution is little inefficient.

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

How to handle client response during Transient Exception retrying?

Context
I'm developing a REST API that, as you might expect, is backed by multiple external cross-network services, APIs, and databases. It's very possible that a transient failure is encountered at any point and for which the operation should be retried. My question is, during that retry operation, how should my API respond to the client?
Suppose a client is POSTing a resource, and my server encounters a transient exception when attempting to write to the database. Using a combination of the Retry Pattern perhaps with the Circuit Breaker Pattern, my server-side code should attempt to retry the operation, following randomized linear/exponential back-off implementations. The client would obviously be left waiting during that time, which is not something we want.
Questions
Where does the client fit into the retry operation?
Should I perhaps provide an isTransient: true indicator in the JSON response and leave the client to retry?
Should I leave retrying to the server and respond with a message and status code indicative that the server is actively retrying the request and then have the client poll for updates? How would you determine the polling interval in that case without overloading the server? Or, should the server respond via a web socket instead so the client need not poll?
What happens if there is an unexpected server crash during the retry operation? Obviously, when the server recovers, it won't "remember" the fact that it was retrying an operation unless that fact was persisted somewhere. I suppose that's a non-critical issue that would just cause further unnecessary complexity if I attempted to solve it.
I'm probably over-thinking the issue, but while there is a lot of documentation about implementing transient exception retry logic, seldom have I come across resources that discuss how to leave the client "pending" during that time.
Note: I realize that similar questions have been asked, but my queries are more specific, for I'm specifically interested in the different options for where the client fits into a given retry operation, how the client should react in those cases, and what happens should a crash occur that interrupts a retry sequence.
Thank you very much.
There are some rules for retry:
always create an idempotency key to understand that there is retry operation.
if your operation a complex and you want to wrap rest call with retry, you must ensure that for duplicate requests no side effects will be done(start from failure point and don't execute success code).
Personally, I think the client should not know that you retry something, and of course, isTransient: true should not be as a part of the resource.
Warning: Before add retry policy to something you must check side effects, put retry policy everywhere is bad practice

Netty 4.0 SO_Keeplive one connection send many request to server How to process request concurrently

one connection send many request to server
How to process request concurrently.
Please use a simple example like timeserver or echoserver in netty.io
to illustrate the operation.
One way I could find out is to create a separate threaded handler that will be called as in a producer/consumer way.
The producer will be your "network" handler, giving message to the consumers, therefore not waiting for any wanswear and being able then to proceed with the next request.
The consumer will be your "business" handler, one per connection but possibly multi-threaded, consuming with multiple instances the messages and being able to answer using the Netty's context from the connection from which it is attached.
Another option for the consumer would be to have only one handler, still multi-threaded, but then message will come in with the original Netty's Context such that it can answear to the client, whatever the connection attached.
But the difficulties will come soon:
How to deal with an answear among several requests on client side: let say the client sends 3 requests A, B and C and the answears will come back, due to speed of the Business handler, as C, A, B... You have to deal with it, and knowing for which request the answer is.
You have to ensure all the ways the context given in parameter is still valid (channel active), if you don't want to have too many errors.
Perhaps the best way would be to however handle your request in order (as Netty does), and keep the answear's action as quick as possible.

hornetq - view the available queues

I'm working with an application that requires the use of hornet-q's.
It's kind of hit or miss for some reason. When I create a queue, the first message to that queue works, but a second does not, so I've tried using a new queue for each connection to the REST API that is running on JBOSS. Sometimes this is okay, sometimes I get 412 - precondition failed (when the same name is used more than once) or just simply 500 internal errors.
The application has a /api/hornet-queue/queues/ path, but it doesn't allow GET requests.
Is there another way to tell what queues are open?
you are leaking a consumer and the message is being held on the consumer..
Either reuse the same consumer, or close the consumer.
in case you require to close consumers like this, set consumer-window-size to 0, so you won't cache messages and waste resoruces.