I am new to Cassandra and trying migrate my App from MongoDB to Cassandra
I have the following collections in MongoDB
PhotoAlbums
[
{id: oid1, title:t1, auth: author1, tags: ['bob', 'fun'], photos: [pid1, pid2], views:200 }
{id: oid2, title:t2, auth: author2, tags: ['job', 'fun'], photos: [pid3, pid4], views: 300 }
{id: oid3, title:t3, auth: author3, tags: ['rob', 'fun'], photos: [pid2, pid4], views: 400 }
....
]
Photos
[
{id: pid1, cap:t1, auth: author1, path:p1, tags: ['bob','fun'], comments:40, views:2000, likes:0 }
{id: pid2, cap:t2, auth: author2, path:p2, tags: ['job','fun'], comments:50, views:50, likes:1, liker:[bob] }
{id: pid3, cap:t3, auth: author3, path:p3, tags: ['rob','fun'], comments:60, views: 6000, likes: 0 }
...
]
Comments
[
{id: oid1, photo_id: pid1, commenter: bob, text: photo is cool, likes: 1, likers: [john], replies: [{rep1}, {rep2}]}
{id: oid2, photo_id: pid1, commenter: bob, text: photo is nice, likes: 1, likers: [john], replies: [{rep1}, {rep2}]}
{id: oid3, photo_id: pid2, commenter: bob, text: photo is ok, likes: 2, likers: [john, bob], replies: [{rep1}]}
]
Queries:
Query 1: Show a list of popular albums (based on number of likes)
Query 2: Show a list of most discussed albums (based on number of
comments)
Query 3: Show a list of all albums of a given author on
user's page
Query 4: Show the album with all photos and all comments
(pull album details, show photo thumbnails of all photos in the
album, show all comments of selected photo
Query 5: Show a list of
related albums based on the tags of current album
Given the above schema and requirements, how should I model this in Cassandra?
As I have experience with both Cassandra and Mongo, I'll take a shot at this. The tricky part here, is that MongoDB allows for very loose restrictions around indexing and querying. Cassandra has a trickier model in that respect, but one that should perform fast, at scale, if created correctly. Also, the aspect of counting likes/views/comments on a photo or album can also get tricky, as you'll want to use Cassandra's counter type for that (which has its own challenges).
Disclaimer: Others may solve these problems differently. And I may choose to solve them differently if my first attempt didn't perform. But this is what I would start with.
To satisfy Query 3 I would create a query table called PhotoAlbumsByAuthor and query it like this:
CREATE TABLE PhotoAlbumsByAuthor (
photoalbumid uuid,
title text,
author text,
tags set<text>,
photos set<uuid>,
PRIMARY KEY(author,title,photoalbumid)
);
> SELECT * FROM photoalbumsbyauthor WHERE author='Malcolm Reynolds';
That will return all albums that the user Malcolm Reynolds has created, sorted by title (as title is the first clustering key).
For Query 4 I would create comments as a user defined type (UDT):
CREATE TYPE yourkeyspacename.comment (
commenter text,
commenttext text
);
Then I would create a query table called PhotosByAlbum and query it like this:
CREATE TABLE PhotosByAlbum (
photoalbumid uuid,
photoid uuid,
cap text,
auth text,
path text,
tags set<text>,
comments map<uuid,frozen <comment>>,
PRIMARY KEY(photoalbumid,photoid)
);
> SELECT * FROM PhotosByAlbum WHERE photoalbumid=a50aa80a-8714-44b4-9b97-43ec4b13daa6;
When you add a comment to this table, the uuid key of the map is the commentid. This way you can quickly grab all of the keys and/or values on your application side. In any case, this will return all photos for a given photoalbumid, along with any comments.
I would solve Query 5 in a similar way, by creating a query table (you should be noticing a pattern by now) called PhotoAlbumsByTag and query it like this:
CREATE TABLE PhotoAlbumsByTag (
tag text,
photoalbumid uuid,
title text,
author text,
photos set<uuid>,
PRIMARY KEY(tag,title,photoalbumid)
)
SELECT * FROM PhotoAlbumsByTag WHERE tag='family';
This will return all photo albums with the "family" tag. Note, that this is a denormalized structure of the tags set<text> used above, which means that a photo album will have one entry in this table for each tag it contains. I thought about possibly reusing one of the prior query tables with a secondary index on tags set<text> (as Cassandra now allows indexes on collections) but secondary indexes don't typically perform well. And you would still have to execute a query for each tag in the current album anyway (using a SELECT with the IN keyword is known to not perform well, either).
As for the first two queries, I would create specific tables to store the likes/views/comments counts like this:
CREATE TABLE PhotoCounters (
photoid uuid,
views counter,
comments counter,
likes counter,
PRIMARY KEY (photoid)
);
When using the counter type, Cassandra requires that the primary key and counters be the only columns in that table (can't mix counters with non-counter columns). And I would also process queries/reports on those offline, in an OLAP fashion, using Hadoop or Spark. Hope this helps.
Related
I use Hasura and I have a social-network like situation.
In which I have a "User" object and a "Feed" object.
Every user has a feed.
I have a relationship from user.id to feed.id.
The relevant mutation is UpsertUserDetails as follows:
mutation UserDetailsUpsert(
$email: String!
$picture: String
) {
insert_users_one(
object: {
email: $email
feed: { data: {} }
picture: $picture
}
on_conflict: { constraint: users_tid_email_key, update_columns: [picture] }
) {
id
}
}
So when I create a new user it also creates a feed for it.
But when I only update user details I don't want it to create a new feed.
I would like to stop the upsert from going through to relationships instead of the above default behavior.
and according to this manual I don't see if its even possible: https://hasura.io/docs/latest/graphql/core/databases/postgres/mutations/upsert.html#upsert-in-nested-mutations
To allow upserting in nested cases, set update_columns: []. By doing this, in case of a conflict, the conflicted column/s will be updated with the new value (which is the same values as they had before and hence will effectively leave them unchanged) and will allow the upsert to go through.
Thanks!
I'd recommend that you design your schema such that bad data cannot be entered in the first place. You can put partial unique indices on the feed table in order to prevent duplicate feeds from ever being created. Since you have both users and groups you can implement it with 2 partial indices.
CREATE UNIQUE INDEX unique_feed_per_user ON feed (user_id)
WHERE user_id IS NOT NULL;
CREATE UNIQUE INDEX unique_feed_per_group ON feed (group_id)
WHERE group_id IS NOT NULL;
This is the simplified structure of my tables - 2 main tables, one relation table.
What's the best way to handle an insert API for this?
If I just have a Client and Supabase:
- First API call to insert book and get ID
- Second API call to insert genre and get ID
- Third API call to insert book-genre relation
This is what I can think of, but 3 API calls seems wrong.
Is there a way where I can do insert into these 3 tables with a single API call from my client, like a single postgres function that I can call?
Please share a general example with the API, thanks!
Is there any reason you need to do this with a single call? I'm assuming from your structure that you're not going to create a new genre for every book you create, so most of the time, you're just inserting a book record and a book_gen_rel record. In the real world, you're probably going to have books that fall into multiple genres, so eventually you're going to be changing your function to handle the insert of a single book along with multiple genres in a single call.
That being said, there are two ways too approach this. You can make multiple API calls from the client (and there's really no problem doing this -- it's quite common). Second, you could do it all in a single call if you create a PostgreSQL function and call it with .rpc().
Example using just client calls to insert a record in each table:
const { data: genre_data, error: genre_error } = await supabase
.from('genre')
.insert([
{ name: 'Technology' }
]);
const genre_id = genre_data[0].id;
const { data: book_data, error: book_error } = await supabase
.from('book')
.insert([
{ name: 'The Joys of PostgreSQL' }
]);
const book_id = book_data[0].id;
const { data: book_genre_rel_data, error: book_genre_rel_error } = await supabase
.from('book_genre_rel_data')
.insert([
{ book_id, genre_id }
]);
Here's a single SQL statement to insert into the 3 tables at once:
WITH genre AS (
insert into genre (name) values ('horror') returning id
),
book AS (
insert into book (name) values ('my scary book') returning id
)
insert into book_genre_rel (genre_id, book_id)
select genre.id, book.id from genre, book
Now here's a PostgreSQL function to do everything in a single function call:
CREATE OR REPLACE FUNCTION public.insert_book_and_genre(book_name text, genre_name text)
RETURNS void language SQL AS
$$
WITH genre AS (
insert into genre (name) values (genre_name) returning id
),
book AS (
insert into book (name) values (book_name) returning id
)
insert into book_genre_rel (genre_id, book_id)
select genre.id, book.id from genre, book
$$
Here's an example to test it:
select insert_book_and_genre('how to win friends by writing good sql', 'self-help')
Now, if you've created that function (inside the Supabase Query Editor), then you can call it from the client like this:
const { data, error } = await supabase
.rpc('insert_book_and_genre', {book_name: 'how I became a millionaire at age 3', genre_name: 'lifestyle'})
Again, I don't recommend this approach, at least not for the genre part. You should insert your genres first (they probably won't change) and simplify this to just insert a book and a book_genre_rel record.
I'm trying to model a cataloging system in DynamodDB. It has "Catalogs" which contains "Collections". Each "Collection" can be tagged by many "Tags".
In an RDBMS I would create a table "Catalogs" with a 1:n relationship with "Collections". "Collections" would have an n:n with "Tags" as a Collection can have multiple Tags and a Tag can belong to multiple Collections.
The queries I want to run are:
1) Get all catalogs
2) Get catalog by ID
3) Get collections by catalog ID
I read on AWS I can use the adjacency list map design (because I have the n:n with "Tags"). So here is my table structure:
PK SK name
cat-1 cat-1 Sales Catalog
cat-1 col-1 Sales First Collection
cat-1 col-2 Sales Second Collection
cat-2 cat-2 Finance Catalog
tag-1 tag-1 Recently Added Tag
col-1 tag-1 (collection, tag relationship)
The problem here is I have to use a scan which I understand to be inefficient in order to get all "Catalogs" because a query's PK has to be an '=' and not a 'Begins With'.
The only thing I can think of is creating another attribute like "GSI_PK" and add "Catalog_1" when the PK is cat-1 and the SK is cat-1, "Catalog_2" when the PK is cat-2 and SK is cat-2. I've never really see this done so I'm not sure if it's the way to go and it takes some maintenance if I ever want to change IDs.
Any ideas how I would accomplish this?
In that case, you can have the PK be the type of the object and the SK be a uuid. A record would look like this { PK: "Catalog", SK: "uuid", ...other catalog fields }. You can then do a get all catalogs by doing a query on the PK = Catalog.
To store the associations you can have a GSI on two fields sourcePK and relatedPK where you could store records that associate things. To associate an object you would create a record like e.g. { PK: "Association", SK: "uuid", sourcePK: "category-1", relatedPK: "collection-1", ... other data on the association }. To find objects associated with the "Catalog" with id 1, you would do a query on the GSI where sourcePK = catalog-1.
With this setup you need to be careful about hot keys and should make sure you never have more than 10GBs of data under the same partition key in a table or index.
Let's walk through it. I'll use GraphQL SDL to layout the design of the data model & queries but you can just apply the same concepts to DynamoDB directly.
Thinking data model first we will have something like:
type Catalog {
id: ID!
name: String
# Use a DynamoDB query on the **Collection** table
# where the **catalogId = $ctx.source.id**. Use a GSI or make catalogId the PK.
collections: [Collection]
}
type Collection {
id: ID!
name: String
# Use a DynamoDB query on the **CollectionTag** table where
# the **collectionId = $ctx.source.id**. Use a GSI or make the collectionId the PK.
tags: [CollectionTag]
}
# The "association map" idea as a GraphQL type. The underlying table has a collectionId and tagId.
# Create objects of this type to associate a collection and tag in the many to many relationship.
type CollectionTag {
# Do a GetItem on the **Collection** table where **id = $ctx.source.collectionId**
collection: Collection
# Do a GetItem on the **Tag** table where **id = $ctx.source.tagId**
tag: Tag
}
type Tag {
id: ID!
name: String
# Use a DynamoDB query on teh **CollectionTag** table where
# the **tagId = $ctx.source.id**. If collectionId is the PK then make a GSI where this tagId is the PK.
collections: [CollectionTag]
}
# Root level queries
type Query {
# GetItem to **Catalog** table where **id = $ctx.args.id**
getCatalog(id: ID!): Catalog
# Scan to **Catalog** table. As long as you don't care about ordering on a filed in particular then
# this will likely be okay at the top level. If you only want all catalogs where "arePublished = 1",
# for example then we would likely change this.
allCatalogs: [Catalog]
# Note: You don't really need a getCollectionsByCatalogId(catalogId: ID!) at the top level because you can
# use `query { getCatalog(id: "***") { collections { ... } } }` which is effectively the same thing.
# You could add another field here if having it at the top level was a requirement
getCollectionsByCatalogId(catalogId: ID!): [Collection]
}
Note: Everywhere I use [Collection] or [Catalog] etc above you should use a CollectionConnection, CatalogConnection, etc wrapper type to enable pagination.
I have made the users email the unique key for my entire users database:
var usersSchema = new Schema({
_id: String, // Users Unique Email address
name: String, // Users name
phone: String, // Users phone number
country: String, // Country
type: String, // tenant/agent/manager/admin
username: String, // Username for login
password: String, // Password string
trello: Object, // Trello auth settings
settings: Object, // Settings for manager and other things
createDate: Number, // Date user was created
contactDate: Number, // Date user was last contacted
activityDate: Number // Date of last activity on this user (update/log/etc)
});
So what if the user changes email address?
Is my only way to delete the record and create it again?
Or is there a smarter way?
And the users._id (email) have relations in 16 other tables.
Example the booking table
var bookingSchema = new Schema({
_id: String, // Unique booking ID
user: String, // User ID --> users._id
property: String, // Property ID --> property._id
checkin: Number, // Check in Date
checkout: Number // Check out Date
});
One user can have a LOT of bookings
What I would do is find all records that matches the email and then do a for (i=1 ; i<booking.length ; i++) and then update the email of each record
Is there a smarter way to update all emails that matches using only one mongo call?
(the reason is there are so many relations, so my loop seems a bit like a very primitive way of doing it)
I would say it's much cleaner to create a field for email and create an Unique Index on that.
Unfortunately still the relationship as the ones inside the Relational databases isn't supported! There are plans according to the latest talks to create this feature natively.
The best solution for you would be to think how to use the sub-documents to make things more consistent.
Consider the following situation:
I have a page it will have following fields:
pageid, title, content, like, follow, field1, field2..., field100, pagecomments, images
Like and follow is counter field that will increase on each click.
Now i am thinking of designing this in Cassandra in following ways:
**TYPE A** page_table {
page_id,
title,
content,
like,
follow,
posted_by,
datetime,
image1,
image2,
field1,
field2...,
field100
}
page_comments {
commentid,
page_id,text,
comment_like,
posted_by,
datetime
}
**TYPE B** page_table {
page_id,
title,
content,
posted_by,
datetime,
image1,
image2,
field1,
field2...,
field100
}
page_like {
page_id,
like
}
page_follow {
page_id,
follow
}
page_comments {
commentid,
page_id,
text,
comment_like,
posted_by,
datetime
}
Which one is best way? Or suggest some good Cassandra database design for this, using CQL
You may want to read up on some noSQL patterns
https://github.com/deanhiller/playorm/wiki/Patterns-Page
If you are going to get all the comments from a page, I don't see any FK's to the comments, which you will need in the page_table which brings to lite, that page is missing a pattern. I will add which is a toMany in nosql is frequenly embedded in the row, rather than having an index. So you have this in alot of designs.
page_table {
page_id,
title,
content,
like,
follow,
posted_by,
datetime,
image1,
image2,
fktocomment1,
fktocomment2,
fktocomment3
}
What is typically done is the fktocomment1 is prefexied with the word "comment" so you can find all the fks by stripping off the comment part and using the fk at the end(There is NO value!).
It is a composite name pattern which you can google.
EDIT: patterns page edited/added that pattern it was so common, I never thought to add it before.