Contention-friendly database architecture for large documents and inner arrays

Contention-friendly database architecture for large documents and inner arrays - google-cloud-firestore

Context
I have a database with a collection of documents using this schema (shortened schema because some data is irrelevant to my problem):
{
title: string;
order: number;
...
...
...
modificationsHistory: HistoryEntry[];
items: ListRow[];
finalItems: ListRow[];
...
...
...
}
These documents can easily reach 100 or 200 kB, depending on the amount of items and finalItems that they hold. It's also very important that they are updated as fast as possible, with the smallest bandwidth usage possible.
This is inside a web application context, using Angular 9 and #angular/fire 6.0.0.
Problems
When the end user edits one item inside the object's item array, like editing just a property, reflecting that inside the database requires me to send the entire object, because firestore's update method doesn't support array indexes inside the field path, the only operations that can be done on arrays are adding or deleting an element as described inside documentation.
However, updating an element of the items array by sending the entire document creates poor performances for anyone without a good connection, which is the case for a lot of my users.
Second issue is that having everything in realtime inside one document makes collaboration hard in my case, because some of these elements can be edited by multiple users at the same time, which creates two issues:
Some write operations may fail due to too much contention on the document if two updates are made in the same second.
The updates are not atomic as we're sending the entire document at once, as it doesn't use transactions to avoid using bandwidth even more.
Solutions I already tried
Subcollections
Description
This was a very simple solution: create a subcollection for items, finalItems and modificationsHistory arrays, making them easy to edit as they now have their own ID so it's easy to reach them to update them.
Why it didn't work
Having a list with 10 finalItems, 30 items and 50 entries inside modificationsHistory means that I need to have a total of 4 listeners opened for one element to be listened entirely. Considering the fact that a user can have many of these elements opened at once, having several dozens of documents being listened creates an equally bad performance situation, probably even worse in a full user case.
It also means that if I want to update a big element with 100 items and I want to update half of them, it'll cost me one write operation per item, not to mention the amount of read operations needed to check permissions, etc, probably 3 per write so 150 read + 50 write just to update 50 items in an array.
Cloud Function to update the document
const {
applyPatch
} = require('fast-json-patch');
function applyOffsets(data, entries) {
entries.forEach(customEntry => {
const explodedPath = customEntry.path.split('/');
explodedPath.shift();
let pointer = data;
for (let fragment of explodedPath.slice(0, -1)) {
pointer = pointer[fragment];
}
pointer[explodedPath[explodedPath.length - 1]] += customEntry.offset;
});
return data;
}
exports.updateList = functions.runWith(runtimeOpts).https.onCall((data, context) => {
const listRef = firestore.collection('lists').doc(data.uid);
return firestore.runTransaction(transaction => {
return transaction.get(listRef).then(listDoc => {
const list = listDoc.data();
try {
const [standard, custom] = JSON.parse(data.diff).reduce((acc, entry) => {
if (entry.custom) {
acc[1].push(entry);
} else {
acc[0].push(entry);
}
return acc;
}, [
[],
[]
]);
applyPatch(list, standard);
applyOffsets(list, custom);
transaction.set(listRef, list);
} catch (e) {
console.log(data.diff);
}
});
});
});
Description
Using a diff library, I was making a diff between previous document and the new updated one, and sending this diff to a GCF that was operating the update using the transaction API.
Benefits of this approach being that since transaction happens inside GCF, it's super fast and doesn't consume too much bandwidth, plus the update only requires a diff to be sent, not the entire document anymore.
Why it didn't work
In reality, the cloud function was really slow and some updates were taking over 2 seconds to be made, they could also fail due to contention, without firestore connector knowing it, so no possibility to ensure data integrity in this case.
I will be edited accordingly to add more solutions if I find other stuff to try
Question
I feel like I'm missing something, like if firestore had something I just didn't know at all that could solve my use case, but I can't figure out what it is, maybe my previously tested solutions were badly implemented or I missed something important. What did I miss? Is it even possible to achieve what I want to do? I am open to data remodeling, query changes, anything, as it's mostly for learning purpose.

You should be able to reduce the bandwidth required to update your documents by using Maps instead of Arrays to store your data. This would allow you to send only the item that is being updated using its key.
I don't know how involved this would be for you to change, but it sounds like less work than the other options.
You said that it's not impossible for your documents to reach 200kb individually. It would be good to keep in mind that Firestore limits document size to 1mb. If you plan on supporting documents beyond that, you will need to find a way to fragment the data.
Regarding your contention issues... You might consider a system that "locks" the document and prevents it from receiving updates while another user is attempting to save. You could use a simple message system built with websockets or Firebase FCM to do this. A client would subscribe to the document's channel, and publish when they are attempting an update. Other clients would then receive a notice that the document is being updated and have to wait before they can save their own changes.
Also, I don't know what the contents of modificationsHistory look like, but that sounds to me like the type of data that you might keep in a subcollection instead.
Of the solutions you tried, the subcollection seems like the most scalable to me. You could look into the possibility of not using onSnapshot listeners and instead create your own event system to notify clients of changes. I suppose it could work similar to the "locking" system I mentioned above. A client sends an event when it updates an item belonging to a document. Other clients subscribed to that document's channel will know to check the database for the newest version.

Your diff-approach appeared mostly sensible, details aside.
You should store items inline, but defer modificationsHistory into a sub collection. For the entire root document, record which elements of modificationsHistory have been merged yet (by timestamp should suffice), and all elements not merged yet, you have to re-apply individually on each client, querying with aforementioned timestamp.
Each entry in modificationsHistory should not describe a single diff, but whenever possible a set of diffs.
Apply changes from modificationsHistory collections onto items in batch, deferred via GCF. You may defer this arbitrarily far, and you may want to exclude modifications performed only in the last few seconds, to account for not established consistency in Firestore. There is no risk of contention, that way.
Cleanup from the modificationsHistory collection has to be deferred even further, until you can be sure that no client has still access to an older revision of the root document. Especially if you consider that the client is not strictly required to update the root document when the listener is triggered.
You may need to reconstruct the patch stack on the client side if modificationsHistory changes in unexpected ways due to eventual consistency constraints. E.g. if you have a total order in the set of patches, you need to re-apply the patch stack from base image if the collection unexpectedly suddenly contains "older" patches unknown to the client before.
All in all, you should be able avoid frequent updates all together, and limit this solely to inserts into to modificationsHistory sub-collection. With bandwidth requirements not exceeding the cost of fetching the entire document once, plus streaming the collection of not-yet-applied patches. No contention expected.
You can tweak for how long clients may ignore hard updates to the root document, and how many changes they may batch client-side before submitting a new diff. Latter is also a tradeof with regard to how many documents another client has to fetch initially, with regard to max-documents-per-query limits.
If you require other information which are likely to suffer from contention, like list of users currently having a specific document open, that should go into sub-collections as well.
Should the latency for seeing changes by other users eventually turn out to be unacceptable, you may opt for an additional, real-time capable data channel for distribution of patches on a specific document. ActiveMQ or some other message broker operated on dedicated resources, running independently from FireStore.

Related

MongoDB - select document for update - without another operation modifying the same document after the select

I've got a document that needs to be read and updated. Meanwhile, it's quite likely that another process is doing the same which would break the document update.
For example, if Process A reads document d and adds field 'a' to it and writes the document, and Process B reads document d before Process A writes it, and adds field b and writes the document, then whichever process writes the changes out will get their change because it clobbers the change by the one that wrote first.
I've read this article and some other very complicated transaction articles around mongo. Can someone describe a simple solution to this - I have not come across something that makes me comfortable with this yet.
https://www.mongodb.com/blog/post/how-to-select--for-update-inside-mongodb-transactions
[UPDATE]- In addition, I'm trying to augment a document that might not yet exist. I need to create the document if it doesn't exist. I also need to read it to analyze it. One key is "relatedIds" (an array). I push to that array if the id is not found in it. Another method I have that needs to create the document if it doesn't exist adds to a separate collection of objects.
[ANOTHER UPDATE x2] --> From what I've been reading and getting from various sources - is that the only way to properly create a transaction for this - is to "findOneAndModify" the document to mark it as dirty with some field that will definitely update, such as "lock" with an objectId (since that will never result in a NO-OP - ie, it definitely causes a change).
If another operation tries to write to it, Mongo can now detect that this record is already part of a transaction.
Thus anything that writes to it will cause a writeError on that other operation. My transaction can then slowly work on that record and have a lock on it. When it writes it out and commits, that record is definitely not touched by anything else. If there's no way to do this without a transaction for some reason, then am I creating the transaction in the easiest way here?

Using Mongo's transactions is the "proper" way to go but i'll offer a simple solution that is sufficient ( with some caveats ).
The simplest solution would be to use findOneAndUpdate to read the document and update a new field, let's call it status, since it is atomic this is possible.
the query would look like so:
const doc = await db.collection.findOneAndUpdate(
{
_id: docId,
status: { $ne: 'processing' }
},
{
$set: {
status: 'processing'
}
}
);
so if dov.value is null then it means (assuming the document exists) that another process is processing it. When you finish processing you just need to reset status to be any other value.
Now because you are inherently locking this document from being read until the process finishes you have to make sure that you handle cases like an error thrown throughout the process, update failure, db connection issue's, etc .
Overall I would be cautious about using this approach as it will only "lock" the document for the "proper" queries ( every single process needs to be updated to use the status field ), which is a little problematic, depending on your usecase.

Determining the type and structure of the database for nested data without references

I have a task: to store user messages from 3 messengers in the database. It is also necessary to exclude the possibility of re-sending the same message.
Accordingly, constant requests to verify the existence of a similar message and add new messages are assumed. I was going to to use a nested structure like:
messenger_name:
sender_id:
recipient_id:
message_hashes
It seems to me, that a document-oriented database like Mongo should be suitable for this. But I do not know how to correctly divide everything into levels.
If I make a collection for each messenger, with a file for each sender, then the files will quickly become large.
Perhaps you advise a more correct approach, or even a different storage system.

I would make 3 different collections for 3 different messengers (as message is assumed to be same if sent from same messenger again).
Inside each collection, each document will represent 1 sender of this messenger and will have structure like,
{
senderId,
[
{
receipentId1,
[messageHash1,messageHash2...]
},
{
receipentId2,
[messageHash1,messageHash2...]
}
}
Then I will create index on the given fields(for faster retrieval to check case of message already exist).
Number of documents in this collection will be not more than the number of users of the app.

How to optimize collection subscription in Meteor?

I'm working on a filtered live search module with Meteor.js.
Usecase & problem:
A user wants to do a search through all the users to find friends. But I cannot afford for each user to ask the complete users collection. The user filter the search using checkboxes. I'd like to subscribe to the matched users. What is the best way to do it ?
I guess it would be better to create the query client-side, then send it the the method to get back the desired set of users. But, I wonder : when the filtering criteria changes, does the new subscription erase all of the old one ? Because, if I do a first search which return me [usr1, usr3, usr5], and after that a search that return me [usr2, usr4], the best would be to keep the first set and simply add the new one to it on the client-side suscribed collection.
And, in addition, if then I do a third research wich should return me [usr1, usr3, usr2, usr4], the autorunned subscription would not send me anything as I already have the whole result set in my collection.
The goal is to spare processing and data transfer from the server.
I have some ideas, but I haven't coded enough of it yet to share it in a easily comprehensive way.
How would you advice me to do to be the more relevant possible in term of time and performance saving ?
Thanks you all.
David

It depends on your application, but you'll probably send a non-empty string to a publisher which uses that string to search the users collection for matching names. For example:
Meteor.publish('usersByName', function(search) {
check(search, String);
// make sure the user is logged in and that search is sufficiently long
if (!(this.userId && search.length > 2))
return [];
// search by case insensitive regular expression
var selector = {username: new RegExp(search, 'i')};
// only publish the necessary fields
var options = {fields: {username: 1}};
return Meteor.users.find(selector, options);
});
Also see common mistakes for why we limit the fields.
performance
Meteor is clever enough to keep track of the current document set that each client has for each publisher. When the publisher reruns, it knows to only send the difference between the sets. So the situation you described above is already taken care of for you.
If you were subscribed for users: 1,2,3
Then you restarted the subscription for users 2,3,4
The server would send a removed message for 1 and an added message for 4.
Note this will not happen if you stopped the subscription prior to rerunning it.
To my knowledge, there isn't a way to avoid removed messages when modifying the parameters for a single subscription. I can think of two possible (but tricky) alternatives:
Accumulate the intersection of all prior search queries and use that when subscribing. For example, if a user searched for {height: 5} and then searched for {eyes: 'blue'} you could subscribe with {height: 5, eyes: 'blue'}. This may be hard to implement on the client, but it should accomplish what you want with the minimum network traffic.
Accumulate active subscriptions. Rather than modifying the existing subscription each time the user modifies the search, start a new subscription for the new set of documents, and push the subscription handle to an array. When the template is destroyed, you'll need to iterate through all of the handles and call stop() on them. This should work, but it will consume more resources (both network and server memory + CPU).
Before attempting either of these solutions, I'd recommend benchmarking the worst case scenario without using them. My main concern is that without fairly tight controls, you could end up publishing the entire users collection after successive searches.

If you want to go easy on your server, you'll want to send as little data to the client as possible. That means every document you send to the client that is NOT a friend is waste. So let's eliminate all that waste.
Collect your filters (eg filters = {sex: 'Male', state: 'Oregon'}). Then call a method to search based on your filter (eg Users.find(filters). Additionally, you can run your own proprietary ranking algorithm to determine the % chance that a person is a friend. Maybe base it off of distance from ip address (or from phone GPS history), mutual friends, etc. This will pay dividends in efficiency in a bit. Index things like GPS coords or other highly unique attributes, maybe try out composite indexes. But remember more indexes means slower writes.
Now you've got a cursor with all possible friends, ranked from most likely to least likely.
Next, change your subscription to match those friends, but put a limit:20 on there. Also, only send over the fields you need. That way, if a user wants to skip this step, you only wasted sending 20 partial docs over the wire. Then, have an infinite scroll or 'load more' button the user can click. When they load more, it's an additive subscription, so it's not resending duplicate info. Discover Meteor describes this pattern in great detail, so I won't.
After a few clicks/scrolls, the user won't find any more friends (because you were smart & sorted them) so they will stop trying & move on to the next step. If you returned 200 possible friends & they stop trying after 60, you just saved 140 docs from going through the pipeline. There's your efficiency.

Atomic get and delete in memcached?

Is there a way to do atomic get-and-delete in memcached?
In other words, I want to get the value for a key if it exists and delete it immediately, so this value can be read once and only once.
I think this pseudocode might work, but note the caveat postscript:
# When setting:
SET key-0 value
SET key-ns 0
# When getting:
ns = INCR key-ns
GET key-{ns - 1}
Constraint: I have millions of keys that could be accessed millions of times, and only a small percentage will have a value set at any given time. I don't want to have to update an atomic counter for every key with every get access request as above.

The canonical, but yet generic, answer to your question is : lock free hash table with a relaxed memory model.
The more relaxed is your memory model the more you get with a good lock free design, it's a way to get more performance out of the same chipset.
Here is a talk about that, I don't think that it's possible to answer to your question with a single post on hash tables and lock free programming, I'm not even trying to do that.

You cannot do this with memcached in a single command since there is no api that supports exactly what your asking for. What I would do to get the behavior your looking for is to implement some sort of marking behavior to signify that another client has or hasn't read the data. For example, you could create a JSON document as follows:
{
"data": "value",
"used": false
}
When you get the item check to see if it has already been used by another client by examining the used field. If it hasn't been used then set the value using the cas you got from the GET command and make sure that the document is updated to reflect the fact that a client has already accessed this key.
If the set operation fails because the cas is invalid then this means that another client has obtained this item and already updated it in memcached to signify that it has been used. In this case you just cancel whatever you were doing with the item and move on.
If the set operation succeeds then this means you client is the sole owner of this data. You can now delete it from memcached and do whatever processing on it you like.
Note that when doing the set I would also add an expiration time of about 5 seconds. This way if you application crashes your documents will clean themselves up if you don't finish with the entire process of deleting them.

To put some code to the answer from #mikewied, I think the basic gist is... (using Node.js):
var Memcached = require('memcached');
var memcache = new Memcached('localhost:11211');
var getOnce = function(key, callback) {
// gets is the check-and-set get (vs regular get)
memcache.gets(key, function(err, data) {
if (!data) {
// Cache miss, nothing to see here.
callback(null);
} else {
var yourData = data[key];
// Do a check-and-set to remove the data from the cache.
// This sets the value to null *only* if no one else already did.
memcache.cas(key, null /* new data */, data.cas, 10, function(err) {
if (err) {
// Check-and-set failed! (Here we'll treat it like a cache miss)
yourData = null;
}
callback(yourData);
});
}
});
};

I'm not an expert on Memcached and so I may be wrong. My answer is from reading the documentation and my experience using Memcached.
IMO this is not possible with memcached's current implementation.
to demonstrate why this is not possible currently here is a simple example to demonstrate the race condition:
two processes start at the same time
both execute a get/delete at the same time
memcached replies to both get commands at the same time
done (the desired result was to have get/delete execute atomically then the second get/delete to fail. instead memcached did get, get, delete, fails to delete)
to get an atomic get/delete would require:
a new command for memcached that is atomic let's call it get_delete
some sort of synchronization lock method of all the memcached clients to ensure both the get and delete commands are executed while the lock is held
so all clients would grab the synchronization lock whenever they need to enter the critcal section (i.e. get, delete) then release the lock after the critical section

How to guard against repeated request?

we have a button in a web game for the users to collect reward. That should only be clicked once, and upon receiving the request, we'll mark it collected in DB.
we've already blocked the buttons in the client from repeated clicking. But that won't help if people resend the package multiple times to our server in short period of time.
what I want is a method to block this from server side.
we're using Playframework 2 (2.0.3-RC2) for server side and so far it's stateless, I'm tempted to use a Set to guard like this:
if processingSet has userId then BadRequest
else put userId in processingSet and handle request
after that remove userId from that Set
but then I'd have to face problem like Updating Scala collections thread-safely and still fail to block the user once we have more than one server behind load balancing.
one possibility I'm thinking about is to have a table in DB in place of the processingSet above, but that would incur 1+ DB operation per request, are there any better solution~?
thanks~

Additional DB operation is relatively 'cheap' solution in that case. You should use it if you'e planning to save the buttons state permanently.
If the button is disabled only for some period of time (for an example until the game is over) you can also consider using the cache API however keep in mind that's not dedicated for solutions which should be stored for long time (it should not be considered as DB alternative).

Given that you're using Mongo and so don't have transactions spanning separate collections, I think you can probably implement this guard using an atomic operation - namely "Update if current", which is effectively CompareAndSwap.
Assuming you've got a collection like "rewards" which has a "collected" attribute, you can update the collected flag to true only if it is currently false and if that operation doesn't fail you can proceed to apply the reward knowing that for any other requests the same operation will fail.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse