GraphQL,Cassandra and denormalization strategy - facebook

Would a database like Cassandra and scheme like GraphQL work well together?
Cassandra ideology is based on the idea of optimizing your queries and denormalizing data. This doesn't seem to really mesh well with a GraphQL ideology where data seems to be accessible in every level of a query.
Example:
Suppose I architect my Cassandra table like so:
User:
name
address
etc... (many properties)
Group:
id
name
user_name (denormalized user, where we generally just need the name of a user)
But with GraphQL, it's one wouldn't exactly expect a denormalized User.
query getGroup {
group(id: 1) {
name
users {
name
}
}
}
So a couple of things:
1.) This GraphQL query could end up hitting our Cassandra database multiple times (assuming no caching). Getting the group name and for each of the users we might even hit it for each user. But lets say our resolve creates multiple User objects with one cassandra call.
2.) We can't really build a cassandra idiomatic database with denormalization and graphql in mind, can we? Otherwise we should expect certain properties of a User aren't returned to us with the query.
To sum up the question, what's the graphql strategy for working with denormalized data? Is it acceptable to omit certain properties that the client thinks are accessible? E.g the client tries to access address of user but we don't have that at the moment because our data is denormalized. Or should one not even worry about denormalization and just let graphQL make calls with a caching mechanism in between the db and graphql. E.g graphql first gets the group, then gets the user data for the group id.

This is a side effect of GraphQL where a query can get quite complex in retrieving the data. But as long as the user is actually requesting the data they need if you are smart about your resolvers the end result will actually be faster.
Consider tools like dataloader to cache when resolving a query.
As far as omitting certain properties graphql validates the response and will throw an error, although it will also return the data you gave. It would probably be better to implement some sort of timeout and throw a more descriptive error if there is an issue retrieving the data.

Related

How to handle circular documents in MongoDB/DynamoDB?

Currently the site is using a relational database (MySQL) however the speed to join all the data is too long and has required caching that has lead to other issues.
The issue is how the two tables would nest into each other creating a circular reference. A simple example would be two tables, one for an ACTOR and a second for a MOVIE. The movie would have the actor and the actor would have a movie. Obviously this is easy in a relational database.
So for example, an ACTOR schema:
ACTOR1
- AGE
- BIO
- MOVIES
- FILM1 (ties to the FILM1 document)
- FILM2
Then the MOVIE schema:
FILM1
- RELEASE DATE
- ACTORS
- ACTOR1 (ties back to the ACTOR document)
- ACTOR2
Speed is the most important thing to me. I can easily add ID's to an ACTOR document in place of the full MOVIE document. However I'm back to multiple calls. Are there any features in a NoSQL database like MongoDB or DynamoDB that could solve this in a single call? Or is NoSQL just not the right choice?
While NoSQL generally recommends denormalization of data models, it is best not to have an unbounded list in a single database entry. To model this data in DynamoDB, you should use an adjacency list for modeling the many-to-many relationship. There's no cost-effective way of modeling the data, that I know of, to allow you to get all the data you want in a single call. However, you have said that speed is most important (without giving a latency requirement), so I will try to give you an idea as to how fast you can get the data if stored in DynamoDB.
Your schemas would become something like this:
Actor {
ActorId, <-- This is the application/database id, not the actor's actual ID
Name,
Age,
Bio
}
Film {
FilmId, <-- This is the application/database id for the film
Title,
Description,
ReleaseDate
}
ActedIn {
ActorId,
FilmId
}
To indicate that an actor acted in a movie, you only need to perform one write (which is consistently single-digit milliseconds using DynamoDB in my experience) to add an ActedIn item to your table.
To get all the movies for an actor, you would need to query once to get all the acted in relationships, and then a batch read to get all the movies. Typical latencies for a query (in my experience) is under 10ms, depending on the network speeds and the amount of data being sent over the network. Since the ActedIn relationship is such a small object, I think you could expect an average case of 5ms for a query, if your query is originating from something that is also running in an AWS datacenter (EC2, Lambda, etc).
Getting a single item is going to be under 5 ms, and you can do that in parallel. There's also a BatchGetItems API, but I don't have any statistics for you on that.
So, is ~10ms fast enough for you?
If not, you can use DAX, which adds a caching layer to DynamoDB and promises request latency of <1ms.
What's the unmaintainable, not-cost-effective way to do this in a single call?
For every ActedIn relationship, store your data like this:
ActedIn {
ActorId,
ActorName,
ActorAge,
ActorBio,
FilmId,
FilmTitle,
FilmDescription,
FilmReleaseDate
}
You only need to make one query for any given Actor to get all of their film details, and only one query to get all the Actor details for a given film. Don't actually do this. The duplicated data means that every time you have to update the details for an Actor, you need to update it for every Film they were in, and similarly for Film details. This will be an operational nightmare.
I'm not convinced; it seems like NoSQL is terrible for this.
You should remember that NoSQL comes in many varieties (NoSQL = Not Only SQL), and so even if one NoSQL solution doesn't work for you, you shouldn't rule it out entirely. If you absolutely need this in a single call, you should consider using a Graph database (which is another type of NoSQL database).

Row level security using prisma and postgres

I am using prisma and yoga graphql servers with a postgres DB.
I want to implement authorization for my graphql queries. I saw solutions like graphql-shield that solve column level security nicely - meaning I can define a permission and according to it block or allow a specific table or column of data (on in graphql terms, block a whole entity or a specific field).
The part I am stuck on is row level security - filtering rows by the data they contain - say I want to allow a logged in user to view only the data that is related to him, so depending on the value in a user_id column I would allow or block access to that row (the logged in user is one example, but there are other usecases in this genre).
This type of security requires running a query to check which rows the current user has access to and I can't find a way (that is not horrible) to implement this with prisma.
If I was working without prisma, I would implement this in the level of each resolver but since I am forwarding my queries to prisma I do not control the internal resolvers on a nested query.
But I do want to work with prisma, so one idea we had was handling this in the DB level using postgres policy. This could work as follows:
Every query we run will be surrounded with “begin transaction” and “commit transaction”
Before the query I want to run “set local context.user_id to 5"
Then I want to run the query (and the policy will filter results according to the current_setting(‘context.user_id’))
For this to work I would need prisma to allow me to either add pre/post queries to each query that runs or let me set a context for the db.
But these options are not available in prisma.
Any ideas?
You can use prisma-client instead of prisma-binding.
With prisma-binding, you define the top level resolver, then delegates to prisma for all the nesting.
On the other hand, prisma-client only returns scalar values of a type, and you need to define the resolvers for the relations. Which means you have complete control on what you return, even for nested queries. (See the documentation for an example)
I would suggest you use prisma-client to apply your security filters on the fields.
With the approach you're looking to take, I'd definitely recommend a look at Graphile. It approaches row-level security essentially the same way that you're thinking of. Unfortunately, it seems like Prisma doesn't help you move away from writing traditional REST-style controller methods in this regard.

How to expose URL friendly UUIDs?

Hello Internet Denizens,
I was reading through a nice database design article and the final determination on how to properly generate DB primary keys was ...
So, in reality, the right solution is probably: use UUIDs for keys,
and don’t ever expose them. The external/internal thing is probably
best left to things like friendly-url treatments, and then (as Medium
does) with a hashed value tacked on the end.
That is, use UUIDs for internal purposes like db joins, but use a friendly-url for external purposes (like a REST API).
My question is ... how do you make uniquely identifiable (and friendly) keys for external purposes?
I've used several APIs: Stripe, QuickBooks, Amazon, etc. and it seems like they use straight up sequential IDs for things like customers, report IDs, etc for retrieving information. It makes me wonder if exposing UUIDs as a security risk is a little overblown b/c in theory you should be able to append a where clause to your queries.
SELECT * FROM products where UUID = <supplied uuid> AND owner/role/group/etc = <logged in user>
The follow-up question is: If you expose a primary key, how do people efficiently restrict access to that resource in a database environment? Assign an owner to a db row?
Interested in the design responses.
Potential Relevant Posts for Further Reading:
Should I use UUIDs for resources in my public API?
It is not a good idea to expose your internal ids to the outside. You should either encode them (with some algorithm) or have a look up table.
Also, do not append parameters provided by user (or URL) to your SQL query (UUIDS or not), this is prone to SQL injection. Use parameterized SQL queries for that.

Update embedded data on referenced data update

I am building a Meteor application and am currently creating the publications and coming up against what seems like a common design quandary around related vs embedded documents. My data model (simplified) has Bookings, each of which have a related Client and a related Service. In order to optimise the speed of retrieving a collection I am embedding the key fields of a Client and Service in the Booking, and also linking to the ID - my Booking model has the following structure:
export interface Booking extends CollectionObject {
client_name: string;
service_name: string;
client_id: string;
service_id: string;
bookingDate: Date;
duration: number;
price: number;
}
In this model, client_id and service_id are references to the linked documents and client_name / service_name are embedded as they are used when displaying a list of bookings.
This all seems fine to me however the missing part of the puzzle is keeping this embedded data up to date. If a user in a separate part of the system updates a service (which would be a reactive collection) then I need this to trigger an update of the service_name to any bookings with the corresponding service ID. Is there an event I should subscribe to for this or am I able to? Client side, I have a form which allows the user to add / edit a Service which simply uses the insert or update method on the MongoObservable collection - the OOP part of me feels like this needs to be overridden in the server code to also then update the related data or am I completely going about this the wrong way?
Is this all irrelevant and shoudl I actually just use https://atmospherejs.com/reywood/publish-composite and return collections of related documents (it just feels like it would harm performance in a production environment when returning several hundred bookings at once)
i use a lot of the "foreign key" concept as you're describing, and do de-normalize data across collection as you're doing with the service name. i do this explicitly to avoid extra lookups / publishes.
i use 2 strategies to keep things up to date. the first is done when the source data is saved, say in a Meteor method call. i'll update the de-normalized data on the spot, touching the other collection(s). i would do all this in a "high read, low write" scenario.
the other strategy is to use collection hooks to fire when the source collection is updated. i use this package: matb33:collection-hooks
conceptually, it's similar to the first, but the hook into knowing when to do it is different.
an example we're using in the current app i'm working on: we have a news feed with comments. news items and comments are in separate collections, and each record the comment collection has the id of the associated news item.
we keep a running comment count associated with the news item itself. whenever a comment is added or removed, we increment/decrement the count and update the news item right away.

simple model when requesting collection and extended model when requesting resource - how

I have the following URI: /articles/:id, where article is a resource on web-service and have associated model/class. Now I need to return only partial data for each resource (to save bandwidth and make for speed) when collection is requested, but when a single item is requested from collection I need to send full data. My question is should I use two models/classes for the same resource on the server and initiate different one depending on collection or single resource is requested? Or maybe there is should be only one model/class but not all fields should be filled with data when a collection is requested? Or maybe there is another approach?
I suggest using the approach suggested here with a fields query parameter.
If the API is going to be open to everyone to use and client usage is going to be unpredictable, then by default you probably need to limit the fields that you return. Just make sure you document in some way all the possible fields that could be used, in case a client actually needs them.
If the API is going to be consumed only by an app or apps you made, then by default you could return all of the fields and then your app can pass that fields parameter to speed things up.