When to use libmemcached in non-blocking mode?

When to use libmemcached in non-blocking mode? - memcached

I'm starting to integrate libmemcached into my application and reading the documentation, there is a non-blocking mode flag. After a quick google, there seems to be a performance advantage to non blocking mode, but are there any disadvantages to running libmemcached in non blocking mode?

Of course there are. The disadvantage would only arise if you needed to ENSURE that the written value actually was written to memcached and did not fail. For example - you're using memcached to store a counter variable which has a sentinel that checks to see if the counter has reached a certain value before performing an operation.
In blocking mode - the memcached client will wait to get a write success response from memcached before proceeding and produce an error if it fails. This way you know the counter was updated. If you tell it to write in non-blocking mode, the client sends the request to increment the counter, but never waits to ensure that it really occurred. Because it doesn't wait, you code execution after the call resumes more quickly, but with the uncertainty of not knowing for sure the counter was ever incremented.
However, since memcached values are destroyed on a service restart (think system crash) you can't ever really be sure a value will be there. Also, with low-memory pruning you also cannot ever be sure the value is 100% correct as it may get pruned by the LRU algorithm - you'd need persistent storage to alleviate this uncertainty.
Given this inherent uncertainty, many people use non-blocking mode to get the performance gain because they can't ever be totally certain the counter value in memcached isn't reset/innacurate anyway, so why not get some performance for the tradeoff.
Hope this clarifies the issue. As a side note - MongoDB has non-blocking writes in persistent storage - which while awesome in its flexibility, gives people using non-blocking mode more of a false sense of security that the write will always succeed...
R

Related

How to distribute tasks between servers where each task must be done by only one server?

Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.
There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.
Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?

Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".
Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:
Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.
Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....
...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.
OK, let's conclude what we have by now:
Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.
And that's it.
P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.

Try rabbitmq worker queues
https://www.rabbitmq.com/tutorials/tutorial-two-python.html
It has an acknowledgement feature so if a task fails or server cashes it will automatically replay your task. Based on your specific use case u can setup retries, etc

"Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?"
You are getting into a known problem in distributed systems, how does a system makes decisions when the system is partitioned. Let me elaborate on this.
A simple statement "server dies" requires quite a deep dive on what does this actually mean. Did the server lost power? Is it the network between your control plane and the server is down (and the task is keep running)? Or, maybe, the task was done successfully, but the failure happened just before the task server was about to report about it? If you want to be 100% correct in deciding the current state of the system - that the same as to say that the system has to be 100% consistent.
This is where CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) comes to play. Since your system may be partitioned at any time (a worker server may get disconnected or die - which is the same state) and you want to be 100% correct/consistent, this means that the system won't be 100% available.
To reiterate the previous paragraph: if the system suspects a task server is down, the system as a whole will have to come to a stop, till it will be able to determine on what happened with the particular task server.
Trade off between consistency and availability is the core of distributed systems. Since you want to be 100% correct, you won't have 100% availability.
While availability is not 100%, you still can improve the system to make it as available as possible. Several approaches may help with that.
Simplest one is to alert a human when the system suspects it is down. The human will get a notification (24/7), wake up, login and do a manual check on what is going on. Whether this approach works for your case - it depends on how much availability you need. But this approach is completely legit and is widely used in the industry (those engineers carrying pagers).
More complicated approach is to let the system to fail over to another task server automatically, if that is possible. Few options are available here, depending on type of task.
First type of task is a re-runnable one, but they have to exist as a single instance. In this case, the system uses "STONITH" (shoot the other node in the head) technic to make sure previous node is dead for good. For example, in a cloud the system would actually kill the whole container of task server and then start a new container as a failover.
Second type of tasks is not re-runnable. For example, a task of transferring money from account A to be B is not (automatically) re-runnable. System does not know if the task failed before or after the money were moved. Hence, the fail over needs to do extra steps to calculate the outcome, which may also be impossible if network is not working correctly. In this cases the system usually goes to halt, till it can make 100% correct decision.
None of these options will give 100% of availability, but they can do as good as possible due to nature of distributed systems.

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?

I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?

If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.

Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."

Implement a good performing "to-send" queue with TCP

In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?

Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.

Memcache-based message queue?

I'm working on a multiplayer game and it needs a message queue (i.e., messages in, messages out, no duplicates or deleted messages assuming there are no unexpected cache evictions). Here are the memcache-based queues I'm aware of:
MemcacheQ: http://memcachedb.org/memcacheq/
Starling: http://rubyforge.org/projects/starling/
Depcached: http://www.marcworrell.com/article-2287-en.html
Sparrow: http://code.google.com/p/sparrow/
I learned the concept of the memcache queue from this blog post:
All messages are saved with an integer as key. There is one key that has the next key and one that has the key of the oldest message in the queue. To access these the increment/decrement method is used as its atomic, so there are two keys that act as locks. They get incremented, and if the return value is 1 the process has the lock, otherwise it keeps incrementing. Once the process is finished it sets the value back to 0. Simple but effective. One caveat is that the integer will overflow, so there is some logic in place that sets the used keys to 1 once we are close to that limit. As the increment operation is atomic, the lock is only needed if two or more memcaches are used (for redundancy), to keep those in sync.
My question is, is there a memcache-based message queue service that can run on App Engine?

I would be very careful using the Google App Engine Memcache in this way. You are right to be worrying about "unexpected cache evictions".
Google expect you to use the memcache for caching data and not storing it. They don't guarantee to keep data in the cache. From the GAE Documentation:
By default, items never expire, though
items may be evicted due to memory
pressure.
Edit: There's always Amazon's Simple Queueing Service. However, this may not meet price/performance levels either as:
There would be the latency of calling from the Google to Amazon servers.
You'd end up paying twice for all the data traffic - paying for it to leave Google and then paying again for it to go in to Amazon.

I have started a Simple Python Memcached Queue, it might be useful:
http://bitbucket.org/epoz/python-memcache-queue/

If you're happy with the possibility of losing data, by all means go ahead. Bear in mind, though, that although memcache generally has lower latency than the datastore, like anything else, it will suffer if you have a high rate of atomic operations you want to execute on a single element. This isn't a datastore problem - it's simply a problem of having to serialize access.
Failing that, Amazon's SQS seems like a viable option.

Why not use Task Queue:
https://developers.google.com/appengine/docs/python/taskqueue/
https://developers.google.com/appengine/docs/java/taskqueue/
It seems to solve the issue without the likely loss of messages in Memcached-based queue.

Until Google impliment a proper job-queue, why not use the data-store? As others have said, memcache is just a cache and could lose queue items (which would be.. bad)
The data-store should be more than fast enough for what you need - you would just have a simple Job model, which would be more flexible than memcache as you're not limited to key/value pairs

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse