cant come up with a good mongodb schema - mongodb

I am solving this problem:
I am building an IMGUR clone, where users can upload images and there is a 'latest uploads' page that shows the last 1000 uploaded images.
users can upload pictures as soon as they sign up, but
until the user verifies their email address, their uploads do not show up in 'latest uploads'
as soon as the user verified their email, their images start showing up.
if a user is banned, their images do not show up in 'latest uploads'
Originally I had Images contain a User ref, I would select the last 1000 images populating the User. I would then iterate over the returned collection discarding images owned by banned or non-verified users. This is broken when the last 1000 images were uploaded by unverified users.
I am considering using an array of inner Image documents on the User object, but that is not ideal either because a User might own a lot of Images and I do not always want to load them when I load the User object.
I am open to any solution

I would do the following based on what knowledge I have of your application:
There are two entities that should exist in two different collections: user and uploads.
The uploads collection will be very large, so we want to make sure we can index and shard the collection to handle the scale and performance required by your queries. With that said, some key elements in uploads are:
uploads=
{
_id:uploadId
user:{id:userId, emailverified:true, banned:false}
ts:uploadTime
.
.
.
}
possible indexes:
i. {ts:1,banned:1,"user.emailverified":1,"user.banned":1} (this index should be multi-purpose)
ii. {"user.id":1,ts:1}
Note that I store some redundant data to optimize your latest 1000 query. The cost is that in the rare case where emailverified and banned have to be updated, you need to run an update on your user collection as well as your uploads collection (this one will require multi:true).
query:
db.uploads.find({ts:{$gt:sometime},banned:false,emailverified:true}.sort({ts:-1}).limit(1000)

Related

How to organize FireStore Collections and Documents based on app similar to BlaBlaCar rides

It's my first time working with FireStore. I'm working on a ridesharing app with Flutter that uses Firebase Auth where users can create trips and offer rides similarly to BlaBlaCar, where other users can send requests to join a ride. I’m having difficulty not only deciding the potential collections and paths to use, but also how to even structure it.
For simplicity at this stage, I want any user to be able to see all trips created, but when they go to their “My Rides” page, they will only see the rides that they’ve participated in. I would be grateful for any kind of feedback.
Here are the options I’ve considered:
Two collections, “Users” and “Trips”. The path would look something like this:
users/uid and trips/tripsId with a created_by field
One collection of “Users” and a sub-collection of “Trips". The path seems to make more sense to me, which would be users/uid/trips/tripId but then I don't know how other users could access all the rides on their home feed.
I'm inclined to go with the first option of two collections. Also very open to any other suggestions or help. Thanks.
I want any user to be able to see all trips created, but when they go
to their “My Rides” page, they will only see the rides that they’ve
participated in
I make the assumption that participating in a ride is either being the author or being a passenger of the ride.
I would go for 2 collections: one for users and one for trips. In a trip document you add two fields:
createdBy with the uid of the creator
participants: an Array where you store the author's uid and all the other participants uids (passengers)
This way you can easily query for:
All the rides
All the rides created by a user
All the rides for which a user is a participant, using arrayContains.
(Regarding the limit of 1 MiB for the maximum size for a document I guess this is not a problem because the number of passengers of a ride shouldn't be so huge that the Array fields size makes the document larger than 1 Mib!)
Note that the second approach with subcollections could also be used since you can query with collections group queries but, based on the elements in your question, I don't see any technical advantage.

How to update collection documents efficiently when changing a specific value in Firestore?

I have 2 collections. One of them is named "USERS", and the other one "MATCHES". USERS, can join in the MATCHES, appearing the avatar of the user who has joined in the match. The problem is that when the user changes their avatar image after joining in the match, the match avatar doesn't changed, because the match has the old avatar.
The avatar is saved as Base64 in Firestore, but I will change it to "Storage" in the near future.
I have been trying to set the reference, but that only gives me the path.
If I have to do a Database Api Call for each match which is joined the user, maybe I have to do 20 Api calls updating the matches. It can be a solution, but not the best.
Maybe the solution is in the Google Functions?
I'm out of ideas.
Maybe the solution is in the Google Functions?
Cloud Functions also access Firestore through an SDK, so they can't magically do things that the SDK doesn't allow.
If you're duplicating data and you update one of the duplicates, you'll have to consider updating the others. If they all need to be updated, that indeed requires a separate call for each duplicate.
If you don't want to have to do this, don't store duplicate data.
For more on the strategies for updating duplicated data, see How to write denormalized data in Firebase

How to get a list of documents based on an array of document ids?

What's the best way to get a list of documents based on another list of document ids?
If I have User and Profile objects.
Users have a single Profile
Users can save other users' profiles
The same profile can't be saved twice
The documents of the savedProfile sub collection are stored based on the uid of the user it belongs to. It also has an attribute of userRef which stores the uid of the user again. Given that savedProfiles is a list of uids. Is there a way to get a list of profiles based on the savedProfiles subcollection? Currently I am able to make a get request for a users saved profiles which returns a list of uids which I store in a variable. I'm just wondering how I would make the next request to get the full profiles of the savedProfiles based on that list? Saving the whole user profile in a users savedProfiles is also not an option since these profiles can be updated and changed quite frequently and it would be expensive to find and change each users saved profiles if that profile has been saved. (If there was 100,000 users with an average of around 10 saved profiles each).
Please tell me if there's a way to make this sort of query or if there's a better way to structure my data. Thanks.
Ok, so I think I managed to find a way. From my cloud function I used admin.firestore().getAll(refList). refList is an array of document references that I want. This returns a promise which gives a list of documents matching those references. The data for each document is accessible like normal with doc.data() I've either solved the problem, or I'm doing something very wrong. Feel free to comment and let me know if what I'm doing is okay. Thanks

Structuring cassandra database

I don't understand one thing about Cassandra. Say, I have similar website to Facebook, where people can share, like, comment, upload images and so on.
Now, let's say, I want to get all of the things my friends did:
Username1 liked you comment
username 2 updated his profile picture
And so on.
So after a lot of reading, I guess I would need to do is create new Column Family for each single thing, for example: user_likes user_comments, user_shares. Basically, anything you can think off, and even after I do that, I would still need to create secondary indexes for most of the columns just so I could search for data? And even so how would I know which users are my friends? Would I need to first get all of my friends id's and then search through all of those Column Families for each user id?
EDIT
Ok so i did some more reading and now i understand things a little bit better, but i still can't really figure out how to structure my tables, so i will set a bounty and i want to get a clear example of how my tables should look like if i want to store and retrieve data in this kind of order:
All
Likes
Comments
Favourites
Downloads
Shares
Messages
So let's say i want to retrieve ten last uploaded files of all my friends or the people i follow, this is how it would look like:
John uploaded song AC/DC - Back in Black 10 mins ago
And every thing like comments and shares would be similar to that...
Now probably the biggest challenge would be to retrieve 10 last things of all categories together, so the list would be a mix of all the things...
Now i don't need an answer with a fully detailed tables, i just need some really clear example of how would i structure and retrieve data like i would do in mysql with joins
With sql, you structure your tables to normalize your data, and use indexes and joins to query. With cassandra, you can't do that, so you structure your tables to serve your queries, which requires denormalization.
You want to query items which your friends uploaded, one way to do this is t have a single table per user, and write to this table whenever a friend of that user uploads something.
friendUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
as an example,
friendUploads {
userA {
12313-upload5 : null
12512-upload6 : null
13512-upload8 : null
}
}
friendUploads {
userB {
11313-upload3 : null
12512-upload6 : null
}
}
Note that upload 6 is duplicated to two different columns, as whoever did upload6 is a friend of both User A and user B.
Now to query the friends upload display of a friend, do a getSlice with a limit of 10 on the userid column. This will return you the first 10 items, sorted by key.
To put newest items first, use a reverse comparator that sorts larger timestamps before smaller timestamps.
The drawback to this code is that when User A uploads a song, you have to do N writes to update the friendUploads columns, where N is the number of people who are friends of user A.
For the value associated with each timestamp-upload-id key, you can store enough information to display the results (probably in a json blob), or you can store nothing, and fetch the upload information using the uploadid.
To avoid duplicating writes, you can use a structure like,
userUploads { #columm family
userid { #column
timestamp-upload-id : null #key : no value
}
}
This stores the uploads for a particular user. Now when want to display the uploads of User B's friends, you have to do N queries, one for each friend of User B, and merge the result in your application. This is slower to query, but faster to write.
Most likely, if users can have thousands of friends, you would use the first scheme, and do more writes rather than more queries, as you can do the writes in the background after the user uploads, but the queries have to happen while the user is waiting.
As an example of denormalization, look at how many writes twitter rainbird does when a single click occurs. Each write is used to support a single query.
In some regards, you "can" treat noSQL as a relational store. In others, you can denormalize to make things faster. For instance, PlayOrm's #OneToMany stored the many like so
user1 -> friend.user23, friend.user25, friend.user56, friend.user87
This is the wide row approach so when you find your user, you have all the foreign keys to his friends. Each row can be different lengths. You may also have a reverse reference stored as well so the user might have references to the people that marked him as a friend but he did not mark them back(let's call it buddy) so you might have
user1 -> friend.user23, friend.user25, buddy.user29, buddy.user37
Notice that if designed right, you may NOT need to "search" for the data. That said, with PlayOrm, you can still do Scalable SQL and do joins(you just have to figure out how to partition your tables so it can scale to trillions of rows).
A row can have millions of columns in it or it could have just 10. We are actually in the process of updating alot of the documentation in PlayOrm and the noSQL patterns this month so if you keep an eye on that, you can also learn more about general noSQL there as well.
Dean
Think of each DB query as of request to the service running on another machine. Your goal is to minimize number of these requests (because each request requires network roundtrip).
Here comes the main difference from RDBMS paradigm: In SQL you would typically use joins and secondary indexes. In cassandra joins aren't possible, since related data would reside on different servers. Things like materialized views are used in cassandra for the same purpose (to fetch all related data with single query).
I'd recommend to read this article:
http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/
And to look into twissandra sample project https://github.com/twissandra/twissandra
This is nice collection of optimization technics for the kind of projects you described.

MongoDB storing user-specific data on shared collection objects

I'm designing an application that processes RSS feeds using MongoDB. Currently my collections are as follows:
Entry
fields: content, feed_id, title, publish_date, url
Feed
fields: description, title, url
User
fields: email_address
subscriptions (embedded collection; fields: feed_id, tags)
A user can subscribe to feeds which are linked from the embedded subscription collection. From the subscriptions I can get a list of all the feeds a user should see and also the corresponding entries.
How should I store entry status information (isRead, isStarred, etc.) that is specific to a user? When a user views an entry I need to record isRead = 1. Two common queries I need to be able to perform are:
Find all entries for a specific feed where isRead = 0 or no status exists currently
For a specific user, mark all entries prior to a publish date with isRead = 1 (this could be hundreds or even thousands of records so it must be efficient)
Hmm, this is a tricky one!
It makes sense to me to store a record for entries that are unread, and delete them when they're read. I'm basing this on the assumption that there will be more read posts than unread for each individual user, so you might as well not have documents for all of those already-read entries sitting around in your DB forever. It also makes it easier to not have to worry about the 16MB document size limit if you're not having to drag around years of history with you everywhere.
For starred entries, I would simply add an array of Entry ObjectIds to User. No need to make these subscription-specific; it'll be much easier to pull a list of items a User has starred that way.
For unread entries, it's a little more complex. I'd still add it as an array, but to satisfy your requirement of being able to quickly mark as-read entries before a specific date, I would denormalize and save the publish-date alongside the Entry ObjectId, in a new 'UnreadEntry' document.
User
fields: email_address, starred_entries[]
subscriptions (embedded collection; fields: feed_id, tags, unread_entries[])
UnreadEntry
fields: id is Entry ObjectId, publish_date
You need to be conscious of the document limit, but 16MB is one hell of a lot of unread entries/feeds, so be realistic about whether that's a limit you really need to worry about. (If it is, it should be fairly straightforward to break out User.subscriptions to its own document.)
Both of your queries now become fairly easy to write:
All entries for a specific feed that are unread:
user.subscriptions.find(feedID).unread_entries
Mark all entries prior to a publish date read:
user.subscriptions.find(feedID).unread_entries.where(publish_date.lte => my_date).delete_all
And, of course, if you simply need to mark all entries in a feed as read, that's very easy:
user.subscriptions.find(feedID).unread_entries.delete_all