How to handle circular documents in MongoDB/DynamoDB? - mongodb

Currently the site is using a relational database (MySQL) however the speed to join all the data is too long and has required caching that has lead to other issues.
The issue is how the two tables would nest into each other creating a circular reference. A simple example would be two tables, one for an ACTOR and a second for a MOVIE. The movie would have the actor and the actor would have a movie. Obviously this is easy in a relational database.
So for example, an ACTOR schema:
ACTOR1
- AGE
- BIO
- MOVIES
- FILM1 (ties to the FILM1 document)
- FILM2
Then the MOVIE schema:
FILM1
- RELEASE DATE
- ACTORS
- ACTOR1 (ties back to the ACTOR document)
- ACTOR2
Speed is the most important thing to me. I can easily add ID's to an ACTOR document in place of the full MOVIE document. However I'm back to multiple calls. Are there any features in a NoSQL database like MongoDB or DynamoDB that could solve this in a single call? Or is NoSQL just not the right choice?

While NoSQL generally recommends denormalization of data models, it is best not to have an unbounded list in a single database entry. To model this data in DynamoDB, you should use an adjacency list for modeling the many-to-many relationship. There's no cost-effective way of modeling the data, that I know of, to allow you to get all the data you want in a single call. However, you have said that speed is most important (without giving a latency requirement), so I will try to give you an idea as to how fast you can get the data if stored in DynamoDB.
Your schemas would become something like this:
Actor {
ActorId, <-- This is the application/database id, not the actor's actual ID
Name,
Age,
Bio
}
Film {
FilmId, <-- This is the application/database id for the film
Title,
Description,
ReleaseDate
}
ActedIn {
ActorId,
FilmId
}
To indicate that an actor acted in a movie, you only need to perform one write (which is consistently single-digit milliseconds using DynamoDB in my experience) to add an ActedIn item to your table.
To get all the movies for an actor, you would need to query once to get all the acted in relationships, and then a batch read to get all the movies. Typical latencies for a query (in my experience) is under 10ms, depending on the network speeds and the amount of data being sent over the network. Since the ActedIn relationship is such a small object, I think you could expect an average case of 5ms for a query, if your query is originating from something that is also running in an AWS datacenter (EC2, Lambda, etc).
Getting a single item is going to be under 5 ms, and you can do that in parallel. There's also a BatchGetItems API, but I don't have any statistics for you on that.
So, is ~10ms fast enough for you?
If not, you can use DAX, which adds a caching layer to DynamoDB and promises request latency of <1ms.
What's the unmaintainable, not-cost-effective way to do this in a single call?
For every ActedIn relationship, store your data like this:
ActedIn {
ActorId,
ActorName,
ActorAge,
ActorBio,
FilmId,
FilmTitle,
FilmDescription,
FilmReleaseDate
}
You only need to make one query for any given Actor to get all of their film details, and only one query to get all the Actor details for a given film. Don't actually do this. The duplicated data means that every time you have to update the details for an Actor, you need to update it for every Film they were in, and similarly for Film details. This will be an operational nightmare.
I'm not convinced; it seems like NoSQL is terrible for this.
You should remember that NoSQL comes in many varieties (NoSQL = Not Only SQL), and so even if one NoSQL solution doesn't work for you, you shouldn't rule it out entirely. If you absolutely need this in a single call, you should consider using a Graph database (which is another type of NoSQL database).

Related

Do we perfom queries to the event store? When and how?

I am new to event sourcing, but as fas as I have understood when we have a command use case, we instantiate an aggregate in memory, apply events to it from the event store so as to be in the correct state, make the proper changes and then store those changes back to the event store. We also have a read model store that will eventually be updated by these changes.
In my case I have a CreateUserUseCase (which is a command use case) and I want to first check if the user already exists and if the username is already taken. For example something like this:
const userAlreadyExists = await this.userRepo.exists(email);
if (userAlreadyExists) {
return new EmailAlreadyExistsError(email);
}
const alreadyCreatedUserByUserName = await this.userRepo
.getUserByUserName(username);
if (alreadyCreatedUserByUserName) {
return new UsernameTakenError(username);
}
const user = new User(username, password, email);
await this.userRepo.save(user);
So, for the save method I would use the event store and append the uncommitted events to it. What about the exists and getUserByUserName methods though? On the one hand I want to make a specific query so I could use my read model store to get the data that I need, but on the other hand this makes a contradiction with CQRS. So what do we do in these cases? Do we, in some way, perform queries to the event store? And in what way do we do this?
Thank you in advance!
CQRS shouldn't be interpreted as "don't query the write model" (because the process of determining state from the write model for the purpose of command processing entails a query, this restriction is untenable). Instead, interpret it as "it's perfectly acceptable to have a different data model for a query than the one you use for handling intentions to update". This formulation implies that if the write model is a good fit for a given query, it's OK to execute the query against the write model.
Event sourcing in turn is arguably (especially in conjunction with certain usage styles) the ultimate in data models that optimize for write vs. read and accordingly the event-sourced model makes nearly all queries outside of a fairly small set so inefficient that some form of CQRS is needed.
What query facilities an event store includes are typically limited, but the one query anything that's a suitable event store will have (because it's needed for replaying the events) is a compound query that amounts to "give me the latest snapshot for that entity and either (if the snapshot exists) the first n events after that snapshot or (if no snapshot) the first n events for that entity". The result of that query is dispositive (modulo things like retention etc.) to the question of "has this entity published events"?

Should I use one projection per entity or category?

I am coding a new application usign CQRS+ES architecture with Event Store DB. In my app, I have the following streams:
user-1
user-2
user-3
...
Each stream contains all events regarding a given user.
I am now creating a projection called user-account, which consists in basic data regarding my user's account (like first name, email, and others)
What is the optimal way to design that projection?
I should have a single projection for each user, creating projections called:
user-account-1
user-account-2
user-account-3
...
Or a single projection for all user-accounts? Being it a key-value pair record (that may store millions of keys in the future)
You can go with one stream per user. Projections are like dimensions. A user can exist in different "dimensions" (CDC naming) and have a different shape in each.
Read https://www.eventstore.com/blog/the-cost-of-creating-a-stream
First, subscribing to individual streams (aggregate or entity streams) won't ever work. You will end up with thousands of subscriptions, which are sitting there doing nothing (how often the user details change?).
The category stream is one way to go, you will project all the events for all the users. Not only you need just one subscription for all your users, but you'll also have more interesting possibilities like "users pending activation" or "blocked users" projections.
I prefer subscribing to $all and apply server-side filtering if necessary. It might have a bit of overhead as you receive more events than you need, but you get so much more power by combining events from different aggregates.
I wrote a little about it in Eventuous documentation.

GraphQL,Cassandra and denormalization strategy

Would a database like Cassandra and scheme like GraphQL work well together?
Cassandra ideology is based on the idea of optimizing your queries and denormalizing data. This doesn't seem to really mesh well with a GraphQL ideology where data seems to be accessible in every level of a query.
Example:
Suppose I architect my Cassandra table like so:
User:
name
address
etc... (many properties)
Group:
id
name
user_name (denormalized user, where we generally just need the name of a user)
But with GraphQL, it's one wouldn't exactly expect a denormalized User.
query getGroup {
group(id: 1) {
name
users {
name
}
}
}
So a couple of things:
1.) This GraphQL query could end up hitting our Cassandra database multiple times (assuming no caching). Getting the group name and for each of the users we might even hit it for each user. But lets say our resolve creates multiple User objects with one cassandra call.
2.) We can't really build a cassandra idiomatic database with denormalization and graphql in mind, can we? Otherwise we should expect certain properties of a User aren't returned to us with the query.
To sum up the question, what's the graphql strategy for working with denormalized data? Is it acceptable to omit certain properties that the client thinks are accessible? E.g the client tries to access address of user but we don't have that at the moment because our data is denormalized. Or should one not even worry about denormalization and just let graphQL make calls with a caching mechanism in between the db and graphql. E.g graphql first gets the group, then gets the user data for the group id.
This is a side effect of GraphQL where a query can get quite complex in retrieving the data. But as long as the user is actually requesting the data they need if you are smart about your resolvers the end result will actually be faster.
Consider tools like dataloader to cache when resolving a query.
As far as omitting certain properties graphql validates the response and will throw an error, although it will also return the data you gave. It would probably be better to implement some sort of timeout and throw a more descriptive error if there is an issue retrieving the data.

Is querying MongoDB faster than Redis?

I have some data stored in a database (MongoDB) and in distributed cache redis.
While querying to the repository, I am using lazy loading approach which first finds the data in the cache if it's available, if not find it in the database and update the cache as well so that next time when the requirement comes it should be found in the cache.
Sample Model Used:
Person ( id, name, age, address (Reference))
Address (id, place)
PersonCacheModel extends Person with addressId.
I am not storing parent object with child object together in the cache that is why I've created personCacheModel with addressId and store this object in the cache and while getting the data personCacheModel converts to person and make a call to address repo to addressCache to fill the address details of the person object.
As far as I understand:
personRepository.findPersonByName(NAME + randomNumber);
Access Data from Cache = network time + cache access time + deserialize time
Access Data from database = network time + database query time + object mapping time
When I ran above approach for 1000 rows, accessing data from the database is faster than the accessing data from the cache. I believe cache access time must be smaller than the accessing MongoDB.
Please let me know if there's an issue with the approach or is this is the expected scenario.
to have a valid benchmark we need to consider hardware side and data processing side:
hardware - do we have same configuration, RAM, CPUs count, OS... etc
process - how data is transformed (on single thread, multi thread, per object, per request)
Performing a load test on your data set will give you an good overview of which process is faster in particular use case scenario.
It is hard to judge - what it should be as long as there mentioned above points will be know for us.
The other thing is to have more than one test scenario and have it stressed in let's say 10 sec time, minute , 5 an hour... so you can have digits that will tell you the truth.

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.