I have a group chat feature in my app that has its messages node structured like this. Currently, it doesn't use the fan-out technique. It just lists all of the messages under the group name e.g. "group1"
groups: {
group1: {
-MEt4K5xhsYL33anhXpP: {
fromUid: "diidssm......."
userImage: "https://firebasestorage..."
text: "hello"
date: 1617919946
emojis: {
"heart": 2
"like": 1
}
}
-MEt8BLP2yMEUMPbG2zV: {
...
}
-MF-Grpl8Jchxpbn2mxH: {
...
}
-MF-OUjWXsFh7lBPosMf: {
...
}
}
}
I first observe the most recent 40 messages and observe whether new children get added as such
ref = Database.database().reference().child("groups").child("group1")
ref.queryLimited(toLast: 40).observe(.childAdded, with: { (snapshot) in
...
//add to messages array to load collection view
//for each message observe emojis and update emojis to reflect changes e.g. +1 like
ref.child("emojis").observe(.value, with: { (snapshot) in
...
})
})
Every time the user scrolls up I load another 40 messages (and observe the emojis child under each of those message nodes) using the last date (and index by date in security rules) as such
ref.queryOrdered(byChild: "date").queryEnding(beforeValue: prevdate, childKey: messageId).queryLimited(toLast: 40).observeSingleEvent(of: .value, with: { (snapshot) in
I understand the fan-out technique is used to get less information per synchronization. If I attach a listener to the groups/groupname/ to get a list of all messages for that group, I will also ask for all the info of each and every message under that node. With the fan out approach I can also just ask for the message information of the 40 most recent messages and the next 40 per scroll up using the keys of the messages from another node like this.
allGroups: {
group1: {
-MEt4K5xhsYL33anhXpP: 1
-MEt8BLP2yMEUMPbG2zV: 1
-MF-Grpl8Jchxpbn2mxH: 1
-MF-OUjWXsFh7lBPosMf: 1
}
}
However, if I am using queryLimited(toLast: 40) is the fan-out approach beneficial or even necessary? Wouldn't this fix the problem of "I will also ask for all the info of each and every message under that node"?
In terms of checking for new messages, I just check using .childAdded in the first code above (ref.queryLimited(toLast: 40).observe(.childAdded)). According to the post below, queryLimited(toLast: 40) will sync the last 40 child nodes, and keep synchronizing those (removing previous ones as new ones are added).
Some questions about keepSynced(true) on a limited query reference
I'm assuming if group1 had 1000 messages, with this approach I am just reading the 40 most recent messages I need and the next 40 per scroll, thus ignoring the other several hundred. Why would I use the fan-out technique then? May be I'm not understanding something fundamental about limited queries.
Side Question: Should I be including references to profile images under each message node? Is it bad to do this in terms of cloud storage and realtime database storage? Ideally there would be hundreds of groupchats.
There's a lot of comments to the question so I thought I would condense all of that into an answer.
The intention of the 'fan out technique' in the question was to maximize query performance.
In this use case the query only returns the last 40 results
ref.queryLimited(toLast: 40)
The assumption in the question was that Firebase had to 'go through' all of the nodes before those 40 to get to the 40, therefore affecting performance. That's not the case with Firebase so whether it be the first 40 or the last 40, the performance is 'the same'.
Because of that, no 'fan-out' is really needed in this situation. For clarity
Fan-out is the process duplicating data in the database. When
data is duplicated it eliminates slow joins and increases read
performance.
I am going to steal a fan out example from an old Firebase Blog. Here's a fan out to update multiple nodes at once, and since it's an atomic operation it either all passes or all fails.
let updatedUser = ["name": "Shannon", "username": "shannonrules"]
let ref = Firebase(url: "https://<YOUR-FIREBASE-APP>.firebaseio.com")
let fanoutObject = ["/users/1": updatedUser,
"/usersWhoAreCool/1": updatedUser,
"/usersToGiveFreeStuffTo/1", updatedUser]
ref.updateChildValues(updatedUser) // atomic updating goodness
I will also include a link to Introducing multi-location updates and more as well as suggesting a read on the topic of denormalization.
In the question, there isn't really any data to 'fan out' so it would not be applicable as there isn't an attempt to join (pull data from multiple nodes) or to update multiple nodes.
The one change I would suggest would be to remove the emoji's node from the message node.
As is, every one of those has an observer which results in thousands of observers which can be difficult to manage. I would create a separate high-level node just for those emojis
emojis
-MEt4K5xhsYL33anhXpP: //the message id
"heart": 2 //or however you want to store them
"like": 1
Then add a single observer (much easier to manage!) to the emoji node. When an emoji changes, that one observer will notify the app of which message it was for, and what the change was. It will also cut down on reads and overall cost.
Related
Context
I have a database with a collection of documents using this schema (shortened schema because some data is irrelevant to my problem):
{
title: string;
order: number;
...
...
...
modificationsHistory: HistoryEntry[];
items: ListRow[];
finalItems: ListRow[];
...
...
...
}
These documents can easily reach 100 or 200 kB, depending on the amount of items and finalItems that they hold. It's also very important that they are updated as fast as possible, with the smallest bandwidth usage possible.
This is inside a web application context, using Angular 9 and #angular/fire 6.0.0.
Problems
When the end user edits one item inside the object's item array, like editing just a property, reflecting that inside the database requires me to send the entire object, because firestore's update method doesn't support array indexes inside the field path, the only operations that can be done on arrays are adding or deleting an element as described inside documentation.
However, updating an element of the items array by sending the entire document creates poor performances for anyone without a good connection, which is the case for a lot of my users.
Second issue is that having everything in realtime inside one document makes collaboration hard in my case, because some of these elements can be edited by multiple users at the same time, which creates two issues:
Some write operations may fail due to too much contention on the document if two updates are made in the same second.
The updates are not atomic as we're sending the entire document at once, as it doesn't use transactions to avoid using bandwidth even more.
Solutions I already tried
Subcollections
Description
This was a very simple solution: create a subcollection for items, finalItems and modificationsHistory arrays, making them easy to edit as they now have their own ID so it's easy to reach them to update them.
Why it didn't work
Having a list with 10 finalItems, 30 items and 50 entries inside modificationsHistory means that I need to have a total of 4 listeners opened for one element to be listened entirely. Considering the fact that a user can have many of these elements opened at once, having several dozens of documents being listened creates an equally bad performance situation, probably even worse in a full user case.
It also means that if I want to update a big element with 100 items and I want to update half of them, it'll cost me one write operation per item, not to mention the amount of read operations needed to check permissions, etc, probably 3 per write so 150 read + 50 write just to update 50 items in an array.
Cloud Function to update the document
const {
applyPatch
} = require('fast-json-patch');
function applyOffsets(data, entries) {
entries.forEach(customEntry => {
const explodedPath = customEntry.path.split('/');
explodedPath.shift();
let pointer = data;
for (let fragment of explodedPath.slice(0, -1)) {
pointer = pointer[fragment];
}
pointer[explodedPath[explodedPath.length - 1]] += customEntry.offset;
});
return data;
}
exports.updateList = functions.runWith(runtimeOpts).https.onCall((data, context) => {
const listRef = firestore.collection('lists').doc(data.uid);
return firestore.runTransaction(transaction => {
return transaction.get(listRef).then(listDoc => {
const list = listDoc.data();
try {
const [standard, custom] = JSON.parse(data.diff).reduce((acc, entry) => {
if (entry.custom) {
acc[1].push(entry);
} else {
acc[0].push(entry);
}
return acc;
}, [
[],
[]
]);
applyPatch(list, standard);
applyOffsets(list, custom);
transaction.set(listRef, list);
} catch (e) {
console.log(data.diff);
}
});
});
});
Description
Using a diff library, I was making a diff between previous document and the new updated one, and sending this diff to a GCF that was operating the update using the transaction API.
Benefits of this approach being that since transaction happens inside GCF, it's super fast and doesn't consume too much bandwidth, plus the update only requires a diff to be sent, not the entire document anymore.
Why it didn't work
In reality, the cloud function was really slow and some updates were taking over 2 seconds to be made, they could also fail due to contention, without firestore connector knowing it, so no possibility to ensure data integrity in this case.
I will be edited accordingly to add more solutions if I find other stuff to try
Question
I feel like I'm missing something, like if firestore had something I just didn't know at all that could solve my use case, but I can't figure out what it is, maybe my previously tested solutions were badly implemented or I missed something important. What did I miss? Is it even possible to achieve what I want to do? I am open to data remodeling, query changes, anything, as it's mostly for learning purpose.
You should be able to reduce the bandwidth required to update your documents by using Maps instead of Arrays to store your data. This would allow you to send only the item that is being updated using its key.
I don't know how involved this would be for you to change, but it sounds like less work than the other options.
You said that it's not impossible for your documents to reach 200kb individually. It would be good to keep in mind that Firestore limits document size to 1mb. If you plan on supporting documents beyond that, you will need to find a way to fragment the data.
Regarding your contention issues... You might consider a system that "locks" the document and prevents it from receiving updates while another user is attempting to save. You could use a simple message system built with websockets or Firebase FCM to do this. A client would subscribe to the document's channel, and publish when they are attempting an update. Other clients would then receive a notice that the document is being updated and have to wait before they can save their own changes.
Also, I don't know what the contents of modificationsHistory look like, but that sounds to me like the type of data that you might keep in a subcollection instead.
Of the solutions you tried, the subcollection seems like the most scalable to me. You could look into the possibility of not using onSnapshot listeners and instead create your own event system to notify clients of changes. I suppose it could work similar to the "locking" system I mentioned above. A client sends an event when it updates an item belonging to a document. Other clients subscribed to that document's channel will know to check the database for the newest version.
Your diff-approach appeared mostly sensible, details aside.
You should store items inline, but defer modificationsHistory into a sub collection. For the entire root document, record which elements of modificationsHistory have been merged yet (by timestamp should suffice), and all elements not merged yet, you have to re-apply individually on each client, querying with aforementioned timestamp.
Each entry in modificationsHistory should not describe a single diff, but whenever possible a set of diffs.
Apply changes from modificationsHistory collections onto items in batch, deferred via GCF. You may defer this arbitrarily far, and you may want to exclude modifications performed only in the last few seconds, to account for not established consistency in Firestore. There is no risk of contention, that way.
Cleanup from the modificationsHistory collection has to be deferred even further, until you can be sure that no client has still access to an older revision of the root document. Especially if you consider that the client is not strictly required to update the root document when the listener is triggered.
You may need to reconstruct the patch stack on the client side if modificationsHistory changes in unexpected ways due to eventual consistency constraints. E.g. if you have a total order in the set of patches, you need to re-apply the patch stack from base image if the collection unexpectedly suddenly contains "older" patches unknown to the client before.
All in all, you should be able avoid frequent updates all together, and limit this solely to inserts into to modificationsHistory sub-collection. With bandwidth requirements not exceeding the cost of fetching the entire document once, plus streaming the collection of not-yet-applied patches. No contention expected.
You can tweak for how long clients may ignore hard updates to the root document, and how many changes they may batch client-side before submitting a new diff. Latter is also a tradeof with regard to how many documents another client has to fetch initially, with regard to max-documents-per-query limits.
If you require other information which are likely to suffer from contention, like list of users currently having a specific document open, that should go into sub-collections as well.
Should the latency for seeing changes by other users eventually turn out to be unacceptable, you may opt for an additional, real-time capable data channel for distribution of patches on a specific document. ActiveMQ or some other message broker operated on dedicated resources, running independently from FireStore.
Trying to implement Event Sourcing and CQRS for the first time, but got stuck when it came to persisting the aggregates.
This is where I'm at now
I've setup "EventStore" an a stream, "foos"
Connected to it from node-eventstore-client
I subscribe to events with catchup
This is all working fine.
With the help of the eventAppeared event handler function I can build the aggregate, whenever events occur. This is great, but what do I do with it?
Let's say I build and aggregate that is a list of Foos
[
{
id: 'some aggregate uuidv5 made from barId and bazId',
barId: 'qwe',
bazId: 'rty',
isActive: true,
history: [
{
id: 'some event uuid',
data: {
isActive: true,
},
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
{
id: 'some event uuid',
data: {
barId: 'qwe',
bazId: 'rty',
},
timestamp: 123456789,
eventType: 'FooCreated'
}
]
}
]
To follow CQRS I will build the above aggregate within a Read Model, right? But how do I store this aggregate in a database?
I guess just a nosql database should be fine for this, but I definitely need a db since I will put a gRPC APi in front of this and other read models / aggreates.
But what do I actually go from when I have built the aggregate, to when to persist it in the db?
I once tried following this tutorial https://blog.insiderattack.net/implementing-event-sourcing-and-cqrs-pattern-with-mongodb-66991e7b72be which was super simple, since you'd use mongodb both as the event store and just create a view for the aggregate and update that one when new events are incoming. It had it's flaws and limitations (the aggregation pipeline) which is why I now turned to "EventStore" for the event store part.
But how to persist the aggregate, which is currently just built and stored in code/memory from events in "EventStore"...?
I feel this may be a silly question but do I have to loop over each item in the array and insert each item in the db table/collection or do you somehow have a way to dump the whole array/aggregate there at once?
What happens after? Do you create a materialized view per aggregate and query against that?
I'm open to picking the best db for this, whether that is postgres/other rdbms, mongodb, cassandra, redis, table storage etc.
Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?
So given that barId and bazId in combination can be used for grouping events, instead of a single stream I'd think more specialized streams such as foos-barId-bazId would be the way to go, to try and reduce the frequency of incoming new events to a point where recreating materialized views will make sense.
Is there a general rule of thumb saying not to recreate/update/refresh materialized views if the update frequency gets below a certain limit? Then the only other a lternative would be querying from a normal table/collection?
Edit:
In the end I'm trying to make a gRPC api that has just 2 rpcs - one for getting a single foo by id and one for getting all foos (with optional field for filtering by status - but that is not so important). The simplified proto would look something like this:
rpc GetFoo(FooRequest) returns (Foo)
rpc GetFoos(FoosRequest) returns (FooResponse)
message FooRequest {
string id = 1; // uuid
}
// If the optional status field is not specified, return all foos
message FoosRequest {
// If this field is specified only return the Foos that has isActive true or false
FooStatus status = 1;
enum FooStatus {
UNKNOWN = 0;
ACTIVE = 1;
INACTIVE = 2;
}
}
message FoosResponse {
repeated Foo foos;
}
message Foo {
string id = 1; // uuid
string bar_id = 2 // uuid
string baz_id = 3 // uuid
boolean is_active = 4;
repeated Event history = 5;
google.protobuf.Timestamp last_updated = 6;
}
message Event {
string id = 1; // uuid
google.protobuf.Any data = 2;
google.protobuf.Timestamp timestamp = 3;
string eventType = 4;
}
The incoming events would look something like this:
{
id: 'some event uuid',
barId: 'qwe',
bazId: 'rty',
timestamp: 123456789,
eventType: 'FooCreated'
}
{
id: 'some event uuid',
isActive: true,
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
As you can see there is no uuid to make it possible to GetFoo(uuid) in the gRPC API, which is why I'll generate a uuidv5 with the barId and bazId, which will combined, be a valid uuid. I'm making that in the projection / aggregate you see above.
Also the GetFoos rpc will either return all foos (if status field is left undefined), or alternatively it'll return the foo's that has isActive that matches the status field (if specified).
Yet I can't figure out how to continue from the catchup subscription handler.
I have the events stored in "EventStore" (https://eventstore.com/), using a subscription with catchup, I have built an aggregate/projection with an array of Foo's in the form that I want them, but to be able to get a single Foo by id from a gRPC API of mine, I guess I'll need to store this entire aggregate/projection in a database of some sort, so I can connect and fetch the data from the gRPC API? And every time a new event comes in I'll need to add that event to the database also or how is this working?
I think I've read every resource I can possibly find on the internet, but still I'm missing some key pieces of information to figure this out.
The gRPC is not so important. It could be REST I guess, but my big question is how to make the aggregated/projected data available to the API service (possible more API's will need it as well)? I guess I will need to store the aggregated/projected data with the generated uuid and history fields in a database to be able to fetch it by uuid from the API service, but what database and how is this storing process done, from the catchup event handler where I build the aggregate?
I know exactly how you feel! This is basically what happened to me when I first tried to do CQRS and ES.
I think you have a couple of gaps in your knowledge which I'm sure you will rapidly plug. You hydrate an aggregate from the event stream as you are doing. That IS your aggregate persisted. The read model is something different. Let me explain...
Your read model is the thing you use to run queries against and to provide data for display to a UI for example. Your aggregates are not (directly) involved in that. In fact they should be encapsulated. Meaning that you can't 'see' their state from the outside. i.e. no getter and setters with the exception of the aggregate ID which would have a getter.
This article gives you a helpful overview of how it all fits together: CQRS + Event Sourcing – Step by Step
The idea is that when an aggregate changes state it can only do so via an event it generates. You store that event in the event store. That event is also published so that read models can be updated.
Also looking at your aggregate it looks more like a typical read model object or DTO. An aggregate is interested in functionality, not properties. So you would expect to see void public functions for issuing commands to the aggregate. But not public properties like isActive or history.
I hope that makes sense.
EDIT:
Here are some more practical suggestions.
"To follow CQRS I will build the above aggregate within a Read Model, right? "
You do not build aggregates in the read model. They are separate things on separate sides of the CQRS side of the equation. Aggregates are on the command side. Queries are done against read models which are different from aggregates.
Aggregates have public void functions and no getter or setters (with the exception of the aggregate id). They are encapsulated. They generate events when their state changes as a result of a command being issued. These events are stored in an event store and are used to recover the state of an aggregate. In other words, that is how an aggregate is stored.
The events go on to be published so the event handlers and other processes can react to them and update the read model and or trigger new cascading commands.
"Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?"
Every couple of seconds is very likely to be fine. I'm more concerned at the persist and update using materialised views. I don't know what you mean by that but it doesn't sound like you have the right idea. Views should be very simple read models. No need to complex relations like you find in an RDMS. And is therefore highly optimised fast for reading.
There can be a lot of confusion on all the terminologies and jargon used in DDD and CQRS and ES. I think in this case, the confusion lies in what you think an aggregate is. You mention that you would like to persist your aggregate as a read model. As #Codescribler mentioned, at the sink end of your event stream, there isn't a concept of an aggregate. Concretely, in ES, commands are applied onto aggregates in your domain by loading previous events pertaining to that aggregate, rehydrating the aggregate by folding each previous event onto the aggregate and then applying the command, which generates more events to be persisted in the event store.
Down stream, a subscribing process receives all the events in order and builds a read model based on the events and data contained within. The confusion here is that this read model, at this end, is not an aggregate per se. It might very well look exactly like your aggregate at the domain end or it could be only creating a read model that doesn't use all the events and or the event data.
For example, you may choose to use every bit of information and build a read model that looks exactly like the aggregate hydrated up to the newest event(likely your source of confusion). You may instead have another process that builds a read model that only tallies a specific type of event. You might even subscribe to multiple streams and "join" them into a big read model.
As for how to store it, this is really up to you. It seems to me like you are taking the events and rebuilding your aggregate plus a history of events in a memory structure. This, of course, doesn't scale, which is why you want to store it at rest in a database. I wouldn't use the memory structure, since you would need to do a lot of state diffing when you flush to the database. You should be modify the database directly in response to each individual event. Ideally, you also transactionally store the stream count with said modification so you don't process the same event again in the case of a failure.
Hope this helps a bit.
I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!
Transactions in Cloud Firestore support getting a document using transaction.getDocument, but even though there is a .getDocuments method, there doesn’t seem to be a .getDocuments for getting multiple documents that works with a transaction.
I have a Yelp-like app using a Cloud Firestore database with the following structure:
- Places to rate are called spots.
- Each spot has a document in the spots collection (identified by a unique documentID).
- Each spot can have a reviews collection containing all reviews for that spot.
- Each review is identified by its own unique documentID, and each review document contains a rating of the spot.
Below is an image of my Cloud Firestore setup with some data.
I’ve tried to create a transaction getting data for all of the reviews in a spot, with the hope that I could then make an updated calculation of average review & save this back out to a property of the spot document. I've tried using:
let db = Firestore.firestore()
db.runTransaction({ (transaction, errorPointer) -> Any? in
let ref = db.collection("spots").document(self.documentID).collection("reviews")
guard let document = try? transaction.getDocuments(ref) else {
print("*** ERROR trying to get document for ref = \(ref)")
return nil
}
…
Xcode states:
Value of type ‘Transaction’ has no member ‘getDocuments’.
There is a getDocument, which that one can use to get a single document (see https://firebase.google.com/docs/firestore/manage-data/transactions).
Is it possible to get a collection of documents in a transaction? I wanted to do this because each place I'm rating (spot) has an averageRating, and whenever there's a change to one of the ratings, I want to call a function that:
- starts a transaction (done)
- reads in all of the current reviews for that spot (can't get to work)
- calculates the new averageRating
- updates the spot with the new averageRating value.
I know Google's FriendlyEats uses a technique where each change is applied to the current average rating value, but I'd prefer to make a precise re-calculation with each change to keep numerical precision (even if it's a bit more expensive w/an additional query).
Thanks for advice.
No. Client libraries do not allow you to make queries inside of transactions. You can only request specific documents inside of a query. You could do something really hacky, like run the query outside of the transaction, then request every individual document inside the transaction, but I would not recommend that.
What might be better is to run this on the server side. Like, say, with a Cloud Function, which does allow you to run queries inside transactions. More importantly, you no longer have to trust the client to update the average review score for a restaurant, which is a Bad Thing.
That said, I still might recommend using a Cloud Function that does some of the same logic that Friendly Eats does, where you say something along the lines of New average = Old average + new review / (Total number of reviews) It'll make sure you're not performing excessive reads if your app gets really popular.
I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 - Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 - Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following "notifications" documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.
I'd go with the following structure:
Use one collection for all actions that happend, Actions
Use another collection for who follows whom, Subscribers
Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.
The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).
If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.
Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.
I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.
To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.
From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.
Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).
on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.
hth