Atomic get and delete in memcached? - memcached

Is there a way to do atomic get-and-delete in memcached?
In other words, I want to get the value for a key if it exists and delete it immediately, so this value can be read once and only once.
I think this pseudocode might work, but note the caveat postscript:
# When setting:
SET key-0 value
SET key-ns 0
# When getting:
ns = INCR key-ns
GET key-{ns - 1}
Constraint: I have millions of keys that could be accessed millions of times, and only a small percentage will have a value set at any given time. I don't want to have to update an atomic counter for every key with every get access request as above.

The canonical, but yet generic, answer to your question is : lock free hash table with a relaxed memory model.
The more relaxed is your memory model the more you get with a good lock free design, it's a way to get more performance out of the same chipset.
Here is a talk about that, I don't think that it's possible to answer to your question with a single post on hash tables and lock free programming, I'm not even trying to do that.

You cannot do this with memcached in a single command since there is no api that supports exactly what your asking for. What I would do to get the behavior your looking for is to implement some sort of marking behavior to signify that another client has or hasn't read the data. For example, you could create a JSON document as follows:
{
"data": "value",
"used": false
}
When you get the item check to see if it has already been used by another client by examining the used field. If it hasn't been used then set the value using the cas you got from the GET command and make sure that the document is updated to reflect the fact that a client has already accessed this key.
If the set operation fails because the cas is invalid then this means that another client has obtained this item and already updated it in memcached to signify that it has been used. In this case you just cancel whatever you were doing with the item and move on.
If the set operation succeeds then this means you client is the sole owner of this data. You can now delete it from memcached and do whatever processing on it you like.
Note that when doing the set I would also add an expiration time of about 5 seconds. This way if you application crashes your documents will clean themselves up if you don't finish with the entire process of deleting them.

To put some code to the answer from #mikewied, I think the basic gist is... (using Node.js):
var Memcached = require('memcached');
var memcache = new Memcached('localhost:11211');
var getOnce = function(key, callback) {
// gets is the check-and-set get (vs regular get)
memcache.gets(key, function(err, data) {
if (!data) {
// Cache miss, nothing to see here.
callback(null);
} else {
var yourData = data[key];
// Do a check-and-set to remove the data from the cache.
// This sets the value to null *only* if no one else already did.
memcache.cas(key, null /* new data */, data.cas, 10, function(err) {
if (err) {
// Check-and-set failed! (Here we'll treat it like a cache miss)
yourData = null;
}
callback(yourData);
});
}
});
};

I'm not an expert on Memcached and so I may be wrong. My answer is from reading the documentation and my experience using Memcached.
IMO this is not possible with memcached's current implementation.
to demonstrate why this is not possible currently here is a simple example to demonstrate the race condition:
two processes start at the same time
both execute a get/delete at the same time
memcached replies to both get commands at the same time
done (the desired result was to have get/delete execute atomically then the second get/delete to fail. instead memcached did get, get, delete, fails to delete)
to get an atomic get/delete would require:
a new command for memcached that is atomic let's call it get_delete
some sort of synchronization lock method of all the memcached clients to ensure both the get and delete commands are executed while the lock is held
so all clients would grab the synchronization lock whenever they need to enter the critcal section (i.e. get, delete) then release the lock after the critical section

Related

RTK Query manual cache update without known previous query args

How to make optimistic/pessimistic cache update (probably with updateQueryData), but without knowing the arguments of previous queries?
updateQueryData(
"getItems",
HERE_ARE_THE_ARGUMENTS_I_DONT_HAVE,
(data) => {
...
}
)
For example:
getPosts query with args search: number
updatePosts mutation
Example actions:
Go to the table, first request getPosts with search = "" is cached.
Write in search input abc.
Second request getPosts with search = "abc" is cached.
I update one element within the table with pessimistic update - successfully modifying cache of second request (previous step)
Clear search input
Table shows the same state from first step, even if one element should be modified
But I need universal solution. I don't know how many different cached entries will be there. And also my case is more complex, because I have other args than "search" to worry about (pagination).
Possible solutions??
It would be perfect if there's a method to access all previous cached queries, so I could go with updateQueryData for each of them, but I cannot find simple way to do that.
I thought about accessing getState() within onQueryStarted and retrieving query parameters from there (in order to do above), but it's not elegant way
I thought about looking for a way to invalidate previous requests without invalidating the last request, but also cannot find it
Okay, I found solution here
Is it possible to optimistically update all caches of a endpoint?
Using
api.util.selectInvalidatedBy(getState(), { type, id }) gives you in return list of cached queries with endpointNames and queryArgs. Then you can easily mutate these caches by updateQueryData.
There's also new problem created here, which is situation where modifying some particular properties may change response from API in unpredictable way. For example in a list you may sort by status, which you can modify. Then changing the cache would be bad idea and instead of that I've created new tag which is invalidated with each update { type: "getItems", id: "status" } and for getItems it would be like { type: "getItems", id: sortBy }

MongoDB - select document for update - without another operation modifying the same document after the select

I've got a document that needs to be read and updated. Meanwhile, it's quite likely that another process is doing the same which would break the document update.
For example, if Process A reads document d and adds field 'a' to it and writes the document, and Process B reads document d before Process A writes it, and adds field b and writes the document, then whichever process writes the changes out will get their change because it clobbers the change by the one that wrote first.
I've read this article and some other very complicated transaction articles around mongo. Can someone describe a simple solution to this - I have not come across something that makes me comfortable with this yet.
https://www.mongodb.com/blog/post/how-to-select--for-update-inside-mongodb-transactions
[UPDATE]- In addition, I'm trying to augment a document that might not yet exist. I need to create the document if it doesn't exist. I also need to read it to analyze it. One key is "relatedIds" (an array). I push to that array if the id is not found in it. Another method I have that needs to create the document if it doesn't exist adds to a separate collection of objects.
[ANOTHER UPDATE x2] --> From what I've been reading and getting from various sources - is that the only way to properly create a transaction for this - is to "findOneAndModify" the document to mark it as dirty with some field that will definitely update, such as "lock" with an objectId (since that will never result in a NO-OP - ie, it definitely causes a change).
If another operation tries to write to it, Mongo can now detect that this record is already part of a transaction.
Thus anything that writes to it will cause a writeError on that other operation. My transaction can then slowly work on that record and have a lock on it. When it writes it out and commits, that record is definitely not touched by anything else. If there's no way to do this without a transaction for some reason, then am I creating the transaction in the easiest way here?
Using Mongo's transactions is the "proper" way to go but i'll offer a simple solution that is sufficient ( with some caveats ).
The simplest solution would be to use findOneAndUpdate to read the document and update a new field, let's call it status, since it is atomic this is possible.
the query would look like so:
const doc = await db.collection.findOneAndUpdate(
{
_id: docId,
status: { $ne: 'processing' }
},
{
$set: {
status: 'processing'
}
}
);
so if dov.value is null then it means (assuming the document exists) that another process is processing it. When you finish processing you just need to reset status to be any other value.
Now because you are inherently locking this document from being read until the process finishes you have to make sure that you handle cases like an error thrown throughout the process, update failure, db connection issue's, etc .
Overall I would be cautious about using this approach as it will only "lock" the document for the "proper" queries ( every single process needs to be updated to use the status field ), which is a little problematic, depending on your usecase.

Contention-friendly database architecture for large documents and inner arrays

Context
I have a database with a collection of documents using this schema (shortened schema because some data is irrelevant to my problem):
{
title: string;
order: number;
...
...
...
modificationsHistory: HistoryEntry[];
items: ListRow[];
finalItems: ListRow[];
...
...
...
}
These documents can easily reach 100 or 200 kB, depending on the amount of items and finalItems that they hold. It's also very important that they are updated as fast as possible, with the smallest bandwidth usage possible.
This is inside a web application context, using Angular 9 and #angular/fire 6.0.0.
Problems
When the end user edits one item inside the object's item array, like editing just a property, reflecting that inside the database requires me to send the entire object, because firestore's update method doesn't support array indexes inside the field path, the only operations that can be done on arrays are adding or deleting an element as described inside documentation.
However, updating an element of the items array by sending the entire document creates poor performances for anyone without a good connection, which is the case for a lot of my users.
Second issue is that having everything in realtime inside one document makes collaboration hard in my case, because some of these elements can be edited by multiple users at the same time, which creates two issues:
Some write operations may fail due to too much contention on the document if two updates are made in the same second.
The updates are not atomic as we're sending the entire document at once, as it doesn't use transactions to avoid using bandwidth even more.
Solutions I already tried
Subcollections
Description
This was a very simple solution: create a subcollection for items, finalItems and modificationsHistory arrays, making them easy to edit as they now have their own ID so it's easy to reach them to update them.
Why it didn't work
Having a list with 10 finalItems, 30 items and 50 entries inside modificationsHistory means that I need to have a total of 4 listeners opened for one element to be listened entirely. Considering the fact that a user can have many of these elements opened at once, having several dozens of documents being listened creates an equally bad performance situation, probably even worse in a full user case.
It also means that if I want to update a big element with 100 items and I want to update half of them, it'll cost me one write operation per item, not to mention the amount of read operations needed to check permissions, etc, probably 3 per write so 150 read + 50 write just to update 50 items in an array.
Cloud Function to update the document
const {
applyPatch
} = require('fast-json-patch');
function applyOffsets(data, entries) {
entries.forEach(customEntry => {
const explodedPath = customEntry.path.split('/');
explodedPath.shift();
let pointer = data;
for (let fragment of explodedPath.slice(0, -1)) {
pointer = pointer[fragment];
}
pointer[explodedPath[explodedPath.length - 1]] += customEntry.offset;
});
return data;
}
exports.updateList = functions.runWith(runtimeOpts).https.onCall((data, context) => {
const listRef = firestore.collection('lists').doc(data.uid);
return firestore.runTransaction(transaction => {
return transaction.get(listRef).then(listDoc => {
const list = listDoc.data();
try {
const [standard, custom] = JSON.parse(data.diff).reduce((acc, entry) => {
if (entry.custom) {
acc[1].push(entry);
} else {
acc[0].push(entry);
}
return acc;
}, [
[],
[]
]);
applyPatch(list, standard);
applyOffsets(list, custom);
transaction.set(listRef, list);
} catch (e) {
console.log(data.diff);
}
});
});
});
Description
Using a diff library, I was making a diff between previous document and the new updated one, and sending this diff to a GCF that was operating the update using the transaction API.
Benefits of this approach being that since transaction happens inside GCF, it's super fast and doesn't consume too much bandwidth, plus the update only requires a diff to be sent, not the entire document anymore.
Why it didn't work
In reality, the cloud function was really slow and some updates were taking over 2 seconds to be made, they could also fail due to contention, without firestore connector knowing it, so no possibility to ensure data integrity in this case.
I will be edited accordingly to add more solutions if I find other stuff to try
Question
I feel like I'm missing something, like if firestore had something I just didn't know at all that could solve my use case, but I can't figure out what it is, maybe my previously tested solutions were badly implemented or I missed something important. What did I miss? Is it even possible to achieve what I want to do? I am open to data remodeling, query changes, anything, as it's mostly for learning purpose.
You should be able to reduce the bandwidth required to update your documents by using Maps instead of Arrays to store your data. This would allow you to send only the item that is being updated using its key.
I don't know how involved this would be for you to change, but it sounds like less work than the other options.
You said that it's not impossible for your documents to reach 200kb individually. It would be good to keep in mind that Firestore limits document size to 1mb. If you plan on supporting documents beyond that, you will need to find a way to fragment the data.
Regarding your contention issues... You might consider a system that "locks" the document and prevents it from receiving updates while another user is attempting to save. You could use a simple message system built with websockets or Firebase FCM to do this. A client would subscribe to the document's channel, and publish when they are attempting an update. Other clients would then receive a notice that the document is being updated and have to wait before they can save their own changes.
Also, I don't know what the contents of modificationsHistory look like, but that sounds to me like the type of data that you might keep in a subcollection instead.
Of the solutions you tried, the subcollection seems like the most scalable to me. You could look into the possibility of not using onSnapshot listeners and instead create your own event system to notify clients of changes. I suppose it could work similar to the "locking" system I mentioned above. A client sends an event when it updates an item belonging to a document. Other clients subscribed to that document's channel will know to check the database for the newest version.
Your diff-approach appeared mostly sensible, details aside.
You should store items inline, but defer modificationsHistory into a sub collection. For the entire root document, record which elements of modificationsHistory have been merged yet (by timestamp should suffice), and all elements not merged yet, you have to re-apply individually on each client, querying with aforementioned timestamp.
Each entry in modificationsHistory should not describe a single diff, but whenever possible a set of diffs.
Apply changes from modificationsHistory collections onto items in batch, deferred via GCF. You may defer this arbitrarily far, and you may want to exclude modifications performed only in the last few seconds, to account for not established consistency in Firestore. There is no risk of contention, that way.
Cleanup from the modificationsHistory collection has to be deferred even further, until you can be sure that no client has still access to an older revision of the root document. Especially if you consider that the client is not strictly required to update the root document when the listener is triggered.
You may need to reconstruct the patch stack on the client side if modificationsHistory changes in unexpected ways due to eventual consistency constraints. E.g. if you have a total order in the set of patches, you need to re-apply the patch stack from base image if the collection unexpectedly suddenly contains "older" patches unknown to the client before.
All in all, you should be able avoid frequent updates all together, and limit this solely to inserts into to modificationsHistory sub-collection. With bandwidth requirements not exceeding the cost of fetching the entire document once, plus streaming the collection of not-yet-applied patches. No contention expected.
You can tweak for how long clients may ignore hard updates to the root document, and how many changes they may batch client-side before submitting a new diff. Latter is also a tradeof with regard to how many documents another client has to fetch initially, with regard to max-documents-per-query limits.
If you require other information which are likely to suffer from contention, like list of users currently having a specific document open, that should go into sub-collections as well.
Should the latency for seeing changes by other users eventually turn out to be unacceptable, you may opt for an additional, real-time capable data channel for distribution of patches on a specific document. ActiveMQ or some other message broker operated on dedicated resources, running independently from FireStore.

Making POST requests idempotent

I have been looking for a way to design my API so it will be idempotent, meaning that some of that is to make my POST request routes idempotent, and I stumbled upon this article.
(If I have understood something not the way it is, please correct me!)
In it, there is a good explanation of the general idea. but what is lacking are some examples of the way that he implemented it by himself.
Someone asked the writer of the article, how would he guarantee atomicity? so the writer added a code example.
Essentially, in his code example there are two cases,
the flow if everything goes well:
Open a transaction on the db that holds the data that needs to change by the POST request
Inside this transaction, execute the needed change
Set the Idempotency-key key and the value, which is the response to the client, inside the Redis store
Set expire time to that key
Commit the transaction
the flow if something inside the code goes wrong:
and exception inside the flow of the function occurs.
a rollback to the transaction is performed
Notice that the transaction that is opened is for a certain DB, lets call him A.
However, it is not relevant for the redis store that he also uses, meaning that the rollback of the transaction will only affect DB A.
So it covers the case when something happends inside the code that make it impossible to complete the transaction.
But what will happend if the machine, which the code runs on, will crash, while it is in a state when it has already executed the Set expire time to that key and it is now about to run the committing of the transaction?
In that case, the key will be available in the redis store, but the transaction has not been committed.
This will result in a situation where the service is sure that the needed changes have already happen, but they didn't, the machine failed before it could finish it.
I need to design the API in such a way that if the change to the data or setting of the key and value in redis fail, that they will both roll back.
What is the solution to this problem?
How can I guarantee the atomicity of a changing the needed data in one database, and in the same time setting the key and the needed response in redis, and if any of them fails, rollback them both? (Including in a case that a machine crashes in the middle of the actions)
Please add a code example when answering! I'm using the same technologies as in the article (nodejs, redis, mongo - for the data itself)
Thanks :)
Per the code example you shared in your question, the behavior you want is to make sure there was no crash on the server between the moment where the idempotency key was set into the Redis saying this transaction already happened and the moment when the transaction is, in fact, persisted in your database.
However, when using Redis and another database together you have two independent points of failure, and two actions being executed sequentially in different moments (and even if they are executed asynchronously at the same time there is no guarantee the server won’t crash before any of them completed).
What you can do instead is include in your transaction an insert statement to a table holding relevant information on this request, including the idempotent key. As the ACID properties ensure atomicity, it guarantees either all the statements on the transaction to be executed successfully or none of them, which means your idempotency key will be available in your database if the transaction succeeded.
You can still use Redis as it’s gonna provide faster results than your database.
A code example is provided below, but it might be good to think about how relevant is the failure between insert to Redis and database to your business (could it be treated with another strategy?) to avoid over-engineering.
async function execute(idempotentKey) {
try {
// append to the query statement an insert into executions table.
// this will be persisted with the transaction
query = ```
UPDATE firsttable SET ...;
UPDATE secondtable SET ...;
INSERT INTO executions (idempotent_key, success) VALUES (:idempotent_key, true);
```;
const db = await dbConnection();
await db.beginTransaction();
await db.execute(query);
// we're setting a key on redis with a value: "false".
await redisClient.setAsync(idempotentKey, false, 'EX', process.env.KEY_EXPIRE_TIME);
/*
if server crashes exactly here, idempotent key will be on redis with false as value.
in this case, there are two possibilities: commit to database suceeded or not.
if on next request redis provides a false value, query database to verify if transaction was executed.
*/
await db.commit();
// you can now set key value to true, meaning commit suceeded and you won't need to query database to verify that.
await redis.setAsync(idempotentKey, true);
} catch (err) {
await db.rollback();
throw err;
}
}

Saving the same document twice concurrently will override the other

Saving the same document twice concurrently will only save one.
I have this flow in my app:
doc.money = 0
get doc (flow 1)
get doc (flow 2)
change doc.money += 5 (flow 1)
change doc.money += 10 (flow 2)
save doc (flow 1)
save doc (flow 2)
Now my doc.money is equal to 10 instead of 15.
How to fix this problem? Not even an error is thrown..
Update with inc: 5 can't be used in my app because of this:
Logic.js (shared both on client and on server):
var logic = function(doc, options){
doc.a = options.x;
// Some very complex logic here...
}
Server.js
// incoming ajax request
// query database and get a doc
logic(doc, options)
doc.save(...)
Client.js
// I have my doc
logic(doc, options);
// Now I have my logic applied
Benefits?
I only write once the logic.js of my app.
No bugs by forgetting to update some part of the logic.
Classic way
Server.js
// incoming ajax request
// query database and get a doc
// Some very complex logic here...
var update = {/*insert here the complex part*/}
Doc.update(cond, update, ...)
Client.js
// I have my doc
// Some very complex logic here...
// Now I have my logic applied
Conclusions
As you can see, in the classical way, you have your logic twice, in my way only once, and changes reflects both the client and the server side logic.
This is actually nothing to do with with 2 phase commits but rather versioning.
Two separate threads in your application are sending two different versions of the same document down.
The best way to to fix this in any database, including ACID ones, is to use versioning: http://askasya.com/post/trackversions
It's called Race Condition. And it's tricky to solve it in MongoDB as opposed to typical SQL databases. They have a solution (or rather a hack) on cookbook.
Basically, within document you have a state key. For every transaction, you keep tab of it. For example, If state is ready, you can perform the work on it. But first you change the state to pending. Once done, you set it back to ready again. So whichever process first gets to it, changes the state, saves it and then next process works on it. You can extend the idea and make it more fail-safe. Have a look at the cookbook link.