Due to some vague reasons we are using replicated orient-db for our application.
It looks likely that we will have cases when a single record could be created twice. Here is how it happens:
we have entities user_document, they have ID of user and ID of document - both users and documents are managed in another application and stored in another database;
this another application on creating new document sends broadcast event (via rabbit-mq topic);
several instances of our application receive this message and create another user_document with the same pair of user_id and document_id.
If I understand correct, we should use UNIQUE index on the pair of these ids and rely upon distributed transactions.
However due to some reasons (we have another team writitng layer between application and database) we probably could not use UNIQUE though it may sound stupid :)
What are our chances then?
Could we, for example, allow all instances to create redundant records and immediately after creation select by user_id and document_id and if more than one were found, delete ones with lexicografically higher own id?
Sure you can do in that way.
You can try to use something like
DELETE FROM (SELECT FROM user_document where user_id=? and document_id=? skip 1)
However, take a notice that without creation of index this approach may consume some additional resources on server, and you might got a significant slow down if user_document have big amount of records.
Related
When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.
Let's say we have two models like this:
User:
_ _id
- name
- email
Company:
- _id
_ name
_ slug
Now let's say I need to connect a user to the company. A user can have one company assigned. To do this, I can add a new field called companyID in the user model. But I'm not sending the _id field to the front end. All the requests that come to the API will have the slug only. There are two ways I can do this:
1) Add slug to relate the company: If I do this, I can take the slug sent from a request and directly query for the company.
2) Add the _id of the company: If I do this, I need to first use the slug to query for the company and then use the _id returned to query for the required data.
May I please know which way is the best? Is there any extra benefit when using the _id of a record for the relationship?
Agree with the 2nd approach. There are several issues to consider when deciding on which field to use as a join key (this is true of all DBs, not just Mongo):
The field must be unique. I'm not sure exactly what the 'slug' field in your schema represents, but if there is any chance this could be duplicated, then don't use it.
The field must not change. Strictly speaking, you can change a key field but the only way to safely do so is to simultaneously change it in all the child tables atomically. This is a difficult thing to do reliably because a) you have to know which tables are using the field (maybe some other developer added another table that you're not aware of) b) If you do it one at a time, you'll introduce race conditions c) If any of the updates fail, you'll have inconsistent data and corrupted parent-child links. Some SQL DBs have a cascading-update feature to solve this problem, but Mongo does not. It's a hard enough problem that you really, really don't want to change a key field if you don't have to.
The field must be indexed. Strictly speaking this isn't true, but if you're going to join on it, then you will be running a lot of queries on it, so you'll need to index it.
For these reasons, it's almost always recommended to use a key field that serves solely as a key field, with no actual information stored in it. Plenty of people have been burned using things like Social Security Numbers, drivers licenses, etc. as key fields, either because there can be duplicates (e.g. SSNs can be duplicated if people are using fake numbers, or if they don't have one), or the numbers can change (e.g. drivers licenses).
Plus, by doing so, you can format the key field to optimize for speed of unique generation and indexing. For example, if you use SSNs, you need to check the SSN against the rest of the DB to ensure it's unique. That takes time if you have millions of records. Similarly for slugs, which are text fields that need to be hashed and checked against an index. OTOH, mongoDB essentially uses UUIDs as keys, which means it doesn't have to check for uniqueness (the algorithm guarantees a high statistical likelihood of uniqueness).
The bottomline is that there are very good reasons not to use a "real" field as your key if you can help it. Fortunately for you, mongoDB already gives you a great key field which satisfies all the above criteria, the _id field. Therefore, you should use it. Even if slug is not a "real" field and you generate it the exact same way as an _id field, why bother? Why does a record have to have 2 unique identifiers?
The second issue in your situation is that you don't expose the company's _id field to the user. Intuitively, it seems like that should be a valuable piece of information that shouldn't be given out willy-nilly. But the truth is, it has no informational value by itself, because, as stated above, a key should have no actual information. The place to implement security is in the query, ensuring that the user doing the query has permission to access the record / specific fields that she's asking for. Hiding the key is a classic security-by-obscurity that doesn't actually improve security.
The only time to hide your primary key is if you're using a poorly thought-out key that does contain useful information. For example, an invoice Id that increments by 1 for each invoice can be used by someone to figure out how many orders you get in a day. Auto-increment Ids can also be easily guessed (if my invoice is #5, can I snoop on invoice #6?). Fortunately, Mongo uses UUIDs so there's really no information leaking out (except maybe for timing attacks on its cryptographic algorithm? And if you're worried about that, you need far more in-depth security considerations than this post :-).
Look at it another way: if a slug reliably points to a specific company and user, then how is it more secure than just using the _id?
That said, there are some instances where exposing a secondary key (like slugs) is helpful, none of which have to do with security. For example, if in the future you need to migrate DB platforms and need to re-generate keys because the new platform can't use your old ones; or if users will be manually typing in identifiers, then it's helpful to give them something easier to remember like slugs. But even in those situations, you can use the slug as a handy identifier for users to use, but in your DB, you should still use the company ID to do the actual join (like in your option #2). Check out this discussion about the pros/cons of exposing _ids to users:
https://softwareengineering.stackexchange.com/questions/218306/why-not-expose-a-primary-key
So my recommendation would be to go ahead and give the user the company Id (along with the slug if you want a human-readable format e.g. for URLs, although mongo _ids can be used in a URL). They can send it back to you to get the user, and you can (after appropriate permission checks) do the join and send back the user data. If you don't want to expose the company Id, then I'd recommend your option #2, which is essentially the same thing except you're adding an additional query to first get the company Id. IMHO, that's a waste of cycles for no real improvement in security, but if there are other considerations, then it's still acceptable. And both of those options are better than using the slug as a primary key.
Second way of approach is the best,That is Add the _id of the company.
Using _id is the best way of practise to query any kind of information,even complex queries can be solved using _id as it is a unique ObjectId created by Mongodb. Population is the process of automatically replacing the specified paths in the document with document(s) from other collection(s). We may populate a single document, multiple documents, plain object, multiple plain objects, or all objects returned from a query.
I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.
The repository in the CommonDomain only exposes the "GetById()". So what to do if my Handler needs a list of Customers for example?
On face value of your question, if you needed to perform operations on multiple aggregates, you would just provide the ID's of each aggregate in your command (which the client would obtain from the query side), then you get each aggregate from the repository.
However, looking at one of your comments in response to another answer I see what you are actually referring to is set based validation.
This very question has raised quite a lot debate about how to do this, and Greg Young has written an blog post on it.
The classic question is 'how do I check that the username hasn't already been used when processing my 'CreateUserCommand'. I believe the suggested approach is to assume that the client has already done this check by asking the query side before issuing the command. When the user aggregate is created the UserCreatedEvent will be raised and handled by the query side. Here, the insert query will fail (either because of a check or unique constraint in the DB), and a compensating command would be issued, which would delete the newly created aggregate and perhaps email the user telling them the username is already taken.
The main point is, you assume that the client has done the check. I know this is approach is difficult to grasp at first - but it's the nature of eventual consistency.
Also you might want to read this other question which is similar, and contains some wise words from Udi Dahan.
In the classic event sourcing model, queries like get all customers would be carried out by a separate query handler which listens to all events in the domain and builds a query model to satisfy the relevant questions.
If you need to query customers by last name, for instance, you could listen to all customer created and customer name change events and just update one table of last-name to customer-id pairs. You could hold other information relevant to the UI that is showing the data, or you could simply hold IDs and go to the repository for the relevant customers in order to work further with them.
You don't need list of customers in your handler. Each aggregate MUST be processed in its own transaction. If you want to show this list to user - just build appropriate view.
Your command needs to contain the id of the aggregate root it should operate on.
This id will be looked up by the client sending the command using a view in your readmodel. This view will be populated with data from the events that your AR emits.
Say that I have a User table in my ReadDatabase (use SQL Server). In a regulare read/write database I can put like a index on the table to make sure that 2 users aren't addedd to the table with the same emailadress.
So if I try to add a user with a emailadress that already exist in my table for a diffrent user, the sql server will throw an exception back.
In Cqrs I can't do that since if I decouple the write to my readdatabas from the domain model, by puting it on an asyncronus queue I wont get the exception thrown back to me, and I will return "OK" to the UI and the user will think that he is added to the database, when infact he will never be added to the read database.
I can do a search in the read database checking if there is a user already in my database with the emailadress, and if there is one, then thru an exception back to the UI. But if they press the save button the same time, I will do 2 checks to the database and see that there isn't any user in the database with the emailadress, I send back that it's okay. Put it on my queue and later it will fail (by hitting the unique identifier).
Am I suppose to load all users from my EventSource (it's a SQL Server) and then do the check on that collection, to see if I have a User that already has this emailadress. That sounds a bit crazy too me...
How have you people solved it?
The way I can see is to not using an asyncronized queue, but use a syncronized one but that will affect perfomance really bad, specially when you have many "read storages" to write to...
Need some help here...
Searching for CQRS Set Based Validation will give you solutions to this issue.
Greg Young posted about the business impact of embracing eventual consistency http://codebetter.com/gregyoung/2010/08/12/eventual-consistency-and-set-validation/
Jérémie Chassaing posted about discovering missing aggregate roots in the domain http://thinkbeforecoding.com/post/2009/10/28/Uniqueness-validation-in-CQRS-Architecture
Related stack overflow questions:
How to handle set based consistency validation in CQRS?
CQRS Validation & uniqueness