Determining the type and structure of the database for nested data without references - mongodb

I have a task: to store user messages from 3 messengers in the database. It is also necessary to exclude the possibility of re-sending the same message.
Accordingly, constant requests to verify the existence of a similar message and add new messages are assumed. I was going to to use a nested structure like:
messenger_name:
sender_id:
recipient_id:
message_hashes
It seems to me, that a document-oriented database like Mongo should be suitable for this. But I do not know how to correctly divide everything into levels.
If I make a collection for each messenger, with a file for each sender, then the files will quickly become large.
Perhaps you advise a more correct approach, or even a different storage system.

I would make 3 different collections for 3 different messengers (as message is assumed to be same if sent from same messenger again).
Inside each collection, each document will represent 1 sender of this messenger and will have structure like,
{
senderId,
[
{
receipentId1,
[messageHash1,messageHash2...]
},
{
receipentId2,
[messageHash1,messageHash2...]
}
}
Then I will create index on the given fields(for faster retrieval to check case of message already exist).
Number of documents in this collection will be not more than the number of users of the app.

Related

Contention-friendly database architecture for large documents and inner arrays

Context
I have a database with a collection of documents using this schema (shortened schema because some data is irrelevant to my problem):
{
title: string;
order: number;
...
...
...
modificationsHistory: HistoryEntry[];
items: ListRow[];
finalItems: ListRow[];
...
...
...
}
These documents can easily reach 100 or 200 kB, depending on the amount of items and finalItems that they hold. It's also very important that they are updated as fast as possible, with the smallest bandwidth usage possible.
This is inside a web application context, using Angular 9 and #angular/fire 6.0.0.
Problems
When the end user edits one item inside the object's item array, like editing just a property, reflecting that inside the database requires me to send the entire object, because firestore's update method doesn't support array indexes inside the field path, the only operations that can be done on arrays are adding or deleting an element as described inside documentation.
However, updating an element of the items array by sending the entire document creates poor performances for anyone without a good connection, which is the case for a lot of my users.
Second issue is that having everything in realtime inside one document makes collaboration hard in my case, because some of these elements can be edited by multiple users at the same time, which creates two issues:
Some write operations may fail due to too much contention on the document if two updates are made in the same second.
The updates are not atomic as we're sending the entire document at once, as it doesn't use transactions to avoid using bandwidth even more.
Solutions I already tried
Subcollections
Description
This was a very simple solution: create a subcollection for items, finalItems and modificationsHistory arrays, making them easy to edit as they now have their own ID so it's easy to reach them to update them.
Why it didn't work
Having a list with 10 finalItems, 30 items and 50 entries inside modificationsHistory means that I need to have a total of 4 listeners opened for one element to be listened entirely. Considering the fact that a user can have many of these elements opened at once, having several dozens of documents being listened creates an equally bad performance situation, probably even worse in a full user case.
It also means that if I want to update a big element with 100 items and I want to update half of them, it'll cost me one write operation per item, not to mention the amount of read operations needed to check permissions, etc, probably 3 per write so 150 read + 50 write just to update 50 items in an array.
Cloud Function to update the document
const {
applyPatch
} = require('fast-json-patch');
function applyOffsets(data, entries) {
entries.forEach(customEntry => {
const explodedPath = customEntry.path.split('/');
explodedPath.shift();
let pointer = data;
for (let fragment of explodedPath.slice(0, -1)) {
pointer = pointer[fragment];
}
pointer[explodedPath[explodedPath.length - 1]] += customEntry.offset;
});
return data;
}
exports.updateList = functions.runWith(runtimeOpts).https.onCall((data, context) => {
const listRef = firestore.collection('lists').doc(data.uid);
return firestore.runTransaction(transaction => {
return transaction.get(listRef).then(listDoc => {
const list = listDoc.data();
try {
const [standard, custom] = JSON.parse(data.diff).reduce((acc, entry) => {
if (entry.custom) {
acc[1].push(entry);
} else {
acc[0].push(entry);
}
return acc;
}, [
[],
[]
]);
applyPatch(list, standard);
applyOffsets(list, custom);
transaction.set(listRef, list);
} catch (e) {
console.log(data.diff);
}
});
});
});
Description
Using a diff library, I was making a diff between previous document and the new updated one, and sending this diff to a GCF that was operating the update using the transaction API.
Benefits of this approach being that since transaction happens inside GCF, it's super fast and doesn't consume too much bandwidth, plus the update only requires a diff to be sent, not the entire document anymore.
Why it didn't work
In reality, the cloud function was really slow and some updates were taking over 2 seconds to be made, they could also fail due to contention, without firestore connector knowing it, so no possibility to ensure data integrity in this case.
I will be edited accordingly to add more solutions if I find other stuff to try
Question
I feel like I'm missing something, like if firestore had something I just didn't know at all that could solve my use case, but I can't figure out what it is, maybe my previously tested solutions were badly implemented or I missed something important. What did I miss? Is it even possible to achieve what I want to do? I am open to data remodeling, query changes, anything, as it's mostly for learning purpose.
You should be able to reduce the bandwidth required to update your documents by using Maps instead of Arrays to store your data. This would allow you to send only the item that is being updated using its key.
I don't know how involved this would be for you to change, but it sounds like less work than the other options.
You said that it's not impossible for your documents to reach 200kb individually. It would be good to keep in mind that Firestore limits document size to 1mb. If you plan on supporting documents beyond that, you will need to find a way to fragment the data.
Regarding your contention issues... You might consider a system that "locks" the document and prevents it from receiving updates while another user is attempting to save. You could use a simple message system built with websockets or Firebase FCM to do this. A client would subscribe to the document's channel, and publish when they are attempting an update. Other clients would then receive a notice that the document is being updated and have to wait before they can save their own changes.
Also, I don't know what the contents of modificationsHistory look like, but that sounds to me like the type of data that you might keep in a subcollection instead.
Of the solutions you tried, the subcollection seems like the most scalable to me. You could look into the possibility of not using onSnapshot listeners and instead create your own event system to notify clients of changes. I suppose it could work similar to the "locking" system I mentioned above. A client sends an event when it updates an item belonging to a document. Other clients subscribed to that document's channel will know to check the database for the newest version.
Your diff-approach appeared mostly sensible, details aside.
You should store items inline, but defer modificationsHistory into a sub collection. For the entire root document, record which elements of modificationsHistory have been merged yet (by timestamp should suffice), and all elements not merged yet, you have to re-apply individually on each client, querying with aforementioned timestamp.
Each entry in modificationsHistory should not describe a single diff, but whenever possible a set of diffs.
Apply changes from modificationsHistory collections onto items in batch, deferred via GCF. You may defer this arbitrarily far, and you may want to exclude modifications performed only in the last few seconds, to account for not established consistency in Firestore. There is no risk of contention, that way.
Cleanup from the modificationsHistory collection has to be deferred even further, until you can be sure that no client has still access to an older revision of the root document. Especially if you consider that the client is not strictly required to update the root document when the listener is triggered.
You may need to reconstruct the patch stack on the client side if modificationsHistory changes in unexpected ways due to eventual consistency constraints. E.g. if you have a total order in the set of patches, you need to re-apply the patch stack from base image if the collection unexpectedly suddenly contains "older" patches unknown to the client before.
All in all, you should be able avoid frequent updates all together, and limit this solely to inserts into to modificationsHistory sub-collection. With bandwidth requirements not exceeding the cost of fetching the entire document once, plus streaming the collection of not-yet-applied patches. No contention expected.
You can tweak for how long clients may ignore hard updates to the root document, and how many changes they may batch client-side before submitting a new diff. Latter is also a tradeof with regard to how many documents another client has to fetch initially, with regard to max-documents-per-query limits.
If you require other information which are likely to suffer from contention, like list of users currently having a specific document open, that should go into sub-collections as well.
Should the latency for seeing changes by other users eventually turn out to be unacceptable, you may opt for an additional, real-time capable data channel for distribution of patches on a specific document. ActiveMQ or some other message broker operated on dedicated resources, running independently from FireStore.

How to model data in NoSQL Firebase datastore?

I want to store following data:
Users,
Events,
Attendees
(similar to Firebase's example given here:
https://www.youtube.com/watch?v=ran_Ylug7AE)
My Firebase store is like the following:
Users - Collection
{
"9582940055" :
{
"name" : "test"
}
}
Every user is a different document. Am I doing it correctly?
If yes, I have kept Mobile number of every user as Document Id instead of auto id, as the mobile number is going to be unique and it will help me in querying. Is this right?
Events - Collection
{
"MkyzuARd8Uelh0qD1WMa" : // auto id for every event
{
"name" : "test",
"attendees" : {
"user": 'Lakshay'
}
}
}
Here, I have kept attendees as a Map inside the Event document. Is it right or should I make Attendees as a collection inside Event document?
Also, "user": 'Lakshay' inside "attendees" is just a string. Is it advisable to use reference data type of Firebase?
Every user is a different document. Am I doing it correctly?
Yes, this is quite common. Initially it may seem a bit weird to have documents with so little data, but over time you'll get used to it (and likely add more data to each user).
I have kept Mobile number of every user as Document Id instead of auto id, as the mobile number is going to be unique and it will help me in querying. Is this right?
If the number is unique for each user in the context of your app, then it can be used to identify users, and thus also as the ID of the user profile documents. It is slightly more idiomatic to use the user's UID for this purpose, since that is the most common way to look up a user. But if you use phone numbers for that and they are unique for each user, you can also use that.
Here, I have kept attendees as a Map inside the Event document. Is it right or should I make Attendees as a collection inside Event document?
That depends...
Storing the events for a user in a single document means you have a limit to how many events you can store for a user, as a document can be no bigger than 1MB.
Storing the events for a user in their document means you always read the data for a user's events, even when you maybe only need to have the user's name. So you'll be reading more data than needed, wasting bandwidth for both you and your users.
Storing the events inside a subcollection allows you to query them, and read a subset of the events of a user.
On the other hand: using a subcollection means you end up reading more smaller documents. So you'd be paying for more document reads from a subcollection, while paying less for bandwidth.
Also, "user": 'Lakshay' inside "attendees" is just a string. Is it advisable to use reference data type of Firebase?
This makes fairly little difference, as there's not a lot of extra functionality that Firestore's DocumentReference field type gives.

Links vs References in Document databases

I am confused with the term 'link' for connecting documents
In OrientDB page http://www.orientechnologies.com/orientdb-vs-mongodb/ it states that they use links to connect documents, while in MongoDB documents are embedded.
Since in MongoDB http://docs.mongodb.org/manual/core/data-modeling-introduction/, documents can be referenced as well, I can not get the difference between linking documents or referencing them.
The goal of Document Oriented databases is to reduce "Impedance Mismatch" which is the degree to which data is split up to match some sort of database schema from the actual objects residing in memory at runtime. By using a document, the entire object is serialized to disk without the need to split things up across multiple tables and join them back together when retrieved.
That being said, a linked document is the same as a referenced document. They are simply two ways of saying the same thing. How those links are resolved at query time vary from one database implementation to another.
That being said, an embedded document is simply the act of storing an object type that somehow relates to a parent type, inside the parent. For example, I have a class as follows:
class User
{
string Name
List<Achievement> Achievements
}
Where Achievement is an arbitrary class (its contents don't matter for this example).
If I were to save this using linked documents, I would save User in a Users collection and Achievement in an Achievements collection with the List of Achievements for the user being links to the Achievement objects in the Achievements collection. This requires some sort of joining procedure to happen in the database engine itself. However, if you use embedded documents, you would simply save User in a Users collection where Achievements is inside the User document.
A JSON representation of the data for an embedded document would look (roughly) like this:
{
"name":"John Q Taxpayer",
"achievements":
[
{
"name":"High Score",
"point":10000
},
{
"name":"Low Score",
"point":-10000
}
]
}
Whereas a linked document might look something like this:
{
"name":"John Q Taxpayer",
"achievements":
[
"somelink1", "somelink2"
]
}
Inside an Achievements Collection
{
"somelink1":
{
"name":"High Score",
"point":10000
}
"somelink2":
{
"name":"High Score",
"point":10000
}
}
Keep in mind these are just approximate representations.
So to summarize, linked documents function much like RDBMS PK/FK relationships. This allows multiple documents in one collection to reference a single document in another collection, which can help with deduplication of data stored. However it adds a layer of complexity requiring the database engine to make multiple disk I/O calls to form the final document to be returned to user code. An embedded document more closely matches the object in memory, this reduces Impedance Mismatch and (in theory) reduces the number of disk I/O calls.
You can read up on Impedance Mismatch here: http://en.wikipedia.org/wiki/Object-relational_impedance_mismatch
UPDATE
I should add, that choosing the right database to implement for your needs is very important from the start. If you have a lot of questions about each database, it might make sense to contact each supplier and get some of their training material. MongoDB offers 2 free courses you can take to learn more about their product and best uses at MongoDB University. OrientDB does offer training, however it is not free. It might be best to try contacting them directly and getting some sort of pre-sales training (if you are looking to license the db), usually they will put you in touch with some sort of pre-sales consultant to help you evaluate their product.
MongoDB works like RDBMS where the object id is like a foreign key. This means a "JOIN" that is run-time expensive. OrientDB, instead, has direct links that are created only once and have a very low run-time cost.

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.

What's the best way to store references to a gridFS file in a document?

I'm using MongoDB to store user profiles, and now I want to use GridFS to store a picture for each profile.
The two ways I'm comparing linking the two documents are:
A) Store a reference to the file ID in the user's image field:
User:
{
"_id": ObjectId('[user_id here]'),
"username": 'myusername',
"image": ObjectId('[file_id here]')
}
B) Store a reference to the user in the file's metadata:
File metadata:
{
"user_id": ObjectId('[user_id here]')
}
I know in a lot of ways it's up to me and dependent on the particulars of the app (it'll be mobile, if that helps), but I'm just wondering if there's any universal benefit to doing it one way or the other?
The answer here really depends on your application's usage pattern. My assumption (feel free to correct me) is that the most likely pattern is something like this:
Look Up User --> Find User --> Display Profile(Fetch Picture)
With this generalized use case, with method A, you find the user document, to display the profile) wich contains the image object ID and you subsequently fetch the file using that ID (2 basic operations and you are done).
Note: the actual fetching of the file from GridFS I am treating as a single logical operation, in reality there are multiple operations involved, but most of the drivers/APIs obscure this anyway.
With method B, you are going to have to find the user document, then do another query to find the relevant user_id in the file metadata collection, then go fetch the file. By my count, that is three operations (an extra find you do not have with method A).
Does that make sense?
Of course, if my assumption is incorrect and your application is (for example) image driven, then your query pattern may come up with a different answer.