How to structure Dynamo 'posts' table(s) to enable querying by user ID OR post timestamp? - nosql

I'm working on a "Buffer-like" (scheduled posts) type application and looking into Dynamo as the data store. I'm more familiar with SQL and am having a little trouble wrapping my head around how best to accomplish this in Dynamo (or if Dynamo is even the right choice given the requirements).
Basics:
Users create posts that are scheduled to be posted at a later date/time (Unix timestamp).
Posts are stored in a Posts table in Dynamo.
Desired functionality/query-ability:
For sake of editing/retrieving scheduled posts, want to be able to query by user ID to retrieve all posts by a given user ID.
On the other hand, when it comes to executing scheduled posts at the appropriate time, would like to have a 'regular' job that runs and sweeps through to find all posts scheduled for, say, the next 15 minutes, based on the timestamp.
Where I'm stumped:
For sake of querying posts, it makes sense to have the user ID be the partition key and a unique post ID serve as the sort key (I think). Using timestamp as the sort key isn't doable as there's no guarantee that user ID + timestamp will be unique.
On the other hand, for executing scheduled posts, user ID is somewhat irrelevant and I just need all posts scheduled in a window between two timestamps (IE the next 15 minutes). Even if the partition key was YYYYMMDD and the sort key was the timestamp this still wouldn't work as again it wouldn't necessarily be unique. And I'd lose the ability to easily query for all posts by a given user ID.
My thought is that user ID = partition key and unique post ID = sort key, and that the timestamp need could be accomplished by a GSI, but then that still would necessitate querying across all partitions, no? (again, still wrapping my head around GSIs)
In summation, wondering if A) This is even feasible with Dynamo and if so how best to accomplish, and B) If I'm trying to fit a square peg in a round hole and should be looking at a different data store option entirely.
Thanks in advance for any help here!

Create a GSI. For the PK provide a constant "Timeline" or whatever so all items go under the same partition. For the SK provide the timestamp. Then you can easily and efficiently Query against the GSI for all posts within a date range (regardless of user). The base table's PK and SK will be projected in so you can pull the post-id from the GSI.
If your write activity is > 1,000 new blog posts per second you'll want to think about adding some write sharding.

Related

NoSQL Database Design with multile indexes

I have a DynnamoDB/NoSQL/MongoDB question. I am from an RDBMS backround and struggling to get a design right for a NoSQL DB. If anyone can help!
I have the following objects (tables in my terms):
Organisation
Users
Courses
Units
I want the following access points, of which most are achievable:
Get/Create/Update and Delete Organisation
Get/Create/Update and Delete Users
Get/Create/Update and Delete Courses
Which I can achieve.
Issue is that the Users and Courses objects has many way to retrieve data:
email
username
For example: List Users on course.
List users for Org.
List courses for Org.
List users in org
list users in unit
All these user secondary indexes, which I semi-understand, but I have tertiary..ish indexes, but that probably my design.
Coming from a relational methodology, I am not sure about reporting, how would it work if I wanted to do a search for all users under the courses which have not completed their (call it status flag)?
From what I understand I need indexes for everting I want to search by?
AWS DynamoDB is my preference, but another NoSQL I'll happily consider. I realise that I need more education regarding NoSQL, so please if anyone can provide good documentation and examples which help the learning process, that will be awesome.
Regards
Richard
I have watched a few UDEMY videos and been Gooling for many weeks (oh and checked here "obviously")
Things to keep in mind
Partitions
In DynamoDB everything is organized in partitions that give you hash-based access to elements. This is very powerful in terms of performance but each partition has limits, so similarly to the hash function in hash maps the partition keys should try to equally distribute the elements
Single Table Design
Don't split the data into multiple tables. This makes everything harder and actually limits the capabilities of the DB. Store everything in a single table.
Keys
Keys in dynamo have to be designed around your access patterns. This is the hardest part.
You have Partition Key (Hash Key) -> this key has to be exactly specified every time. You can't perform a query without knowing the PK. This is why placing things like timestamps into PK is really bad idea.
Sort (Range) keys -> these are used for querying as specified in the AWS docs.
Attribute names
DB migrations are really hard in NoSQL so you have to use generic names for the attributes. They should not have any meaning.
For example "UserID" is a bad name for partition key, "PK" is a good name for partition key, same goes for all keys.
Indexes
You have two types of indexes, local and global.
Local Indexes are created once when you create the table and can't be changed (easily) afterward. You can only have a few of them. They give you an extra sort key to work with. The main benefit is that they are strongly consistent
Global Indexes can be created at any time. They give you both new partition key and sort key to work with but are eventually consistent. Got with global indexes unless you have a good reason to use local.
Going back to your problem, if we focus on one of the table as an example - Users
The user can be inserter like this (for example)
PK SK GSI1PK GSI1SK
Username#john123 Email#jhon#gmail.com Email#jhon#gmail.com Username#john123 <User Data>
This way you can query users by email and username. Keep in mind that PK and SK have to be unique pairs. SK in this case is free and can be used for other access patterns (which you didn't provide)
Another way might be to copy the data
PK SK
Username#john123 Email#jhon#gmail.com <user data>
Email#jhon#gmail.com Username#john123 <user data>
this way you avoid having to deal with indexes (which might be expensive sometimes) but you have to manually keep the user data consistent.
Further reading
-> https://www.alexdebrie.com/posts/dynamodb-single-table/
-> my medium post

Compound indexes in MongoDB efficiency unclear

Im looking for a structure to save userdata for a discord bot.
The context is that i need a unique save for a user for each discord sever (aka. guild) he is on.
Therefore neither userID nor guildID should be unique, but i could use them as compound index to quickly find users inside the users collection.
Is my train of thought correct until now?
My actual question is:
Which ID should be the first index its "sorted" by?
there are multiple hundred or thousand users per guild, but a single user is on about 1-5 guilds the bot is on.
Therefore first searching by guildID would make the amount of data to search in by userID somewhat smaller.
But first searching for userID would make the amount of data to search in by guildID even smaller.
Since the DB will search both indexes completely anyway, so step1 will be similarly quick for both, the second idea with first filtering by userID and then by guildID seems more efficient to me.
I'd like to know if my assumption seems viable, and if not, why not.
Or if there would be a better way that i haven't thought of.
Thanks in advance!
Compound indexes worked fine.
Still not big enough to see any difference in implementation of them, so i don't know about that.

DynamoDB schema for querying date range

I'm learning to use DynamoDB table and storing some job postings with info like date posted, company, and job title.
The query I use most is get all job posting greater than x date.
What partition key should I use so that I can do the above query without using a scan?
Partition key can only be checked for equality so using date as the partition key is no good. Date as the sort key seems best since I can query using equality on that.
However I'm a bit stuck on what is a good partition key to use then. If I put company or job title, I would have to include that as part of my query but I want ALL job postings after a certain date not just for specific company or job.
One way I thought of was using month as a partition key and date as the sort key. That way to get say last 14 days I know I need to hit the partition key of this month and maybe the last month. Then I can use the sort key to just keep the records within the last 14 days. This seems hackish tho.
I would probably do something similar to what you mentioned in the last paragraph - keep a sub-part of the date as the partition key. Either use something like the month, or the first N digits of the unix timestamp, or something similar.
Note that, depending on how large partitions you choose you may still need to perform multiple queries when querying for, say, the last 14 days' of posts due to crossing partition boundaries (when querying for the last 14 days on January 4 you would want to query also for December of the previous year etc), but it should still be usable.
Remember that it's important to choose the partition key so that items are as evenly distributed as possible, so any hacks involving a lot of (or, as is sometimes seen in questions on SO: ALL!) items sharing the same partition key to simplify sorting is not a good idea.
Perhaps you might also want to have a look at Time-to-live to have AWS automatically delete items after a certain amount of time. This way, you could keep one table of the newest items, and "archive" all other items which are not frequently queried. Of course you could also do something similar manually by keeping separate tables for new and archived posts, but TTL is pretty neat for auto-expirying items. Querying for all new posts would then simply be a full scan of the table with the new posts.

Is a good idea to store chat messages in a mongodb collection?

I'm developing a chat app with node.js, redis, socket.io and mongodb. MongoDB comes the last and for persisting the messages.
My question is what would be the best approach for this last step?
I'm afraid a collection with all the messages like
{
id,
from,
to,
datetime,
message
}
can get too big too soon, and is going to get very slow for reading purposes, what do you think?
Is there a better approach you already worked with?
In MongoDB, you store your data in the format you will want to read them later.
If what you read from the database is a list of messages filtered on the 'to' field and with a dynamic datetime filter, then this schema is the perfect fit.
Don't forget to add an index on the fields you will be querying on, then it will be reasonable fast to query them, even over millions of records.
If you would, for example, always show a full history of a full day, you would store all messages for a single day in one document. If both types of queries occur a lot, you would even store your messages in both formats.
If storage is an issue, you could also use capped collection, which will automatically delete messages of e.g. over 1 year old.
I think the db structure is fine, the way you mentioned in your question.
You may assign some unique id for chat between each pair and keep it in each record of chat. Retrieve based on that when you want to show it.
Say 12 is the unique id for chat between A and B, retrieve should be based on 12 when you want to show chat for A and B.
So your db structure can be like:-
{
id,
from,
to,
datetime,
message,
uid
}
Remember, you can optimize your retrieve, if you will give some limit(say 100 at a time) for retrieve. If user is scrolling beyond 100 retrieve more 100 chats. Which will solve lots of retrieve.
When using limit, retrieve based on date created and use sort with find query as well.
Just a thought here, are the messages plain text or are you allowed to share images and videos as well ?
If it's the latter then storing all the chats for a single day in one collection might not work out.
Actually if you have images and videos shares allowed then you need to take into account the. 16mb document restriction as well.

Twitter exercise with MongoDB and lack of transactions?

I was trying to figure out if MongoDB needs transactions and why you wouldn't have everything in a single document. I also know twitter uses HBase which does have transactions so I thought about a tweet and watchers.
If i post a tweet it will be inserted with no problem. But how would I or anyone else find my tweet? I heard mongodb has indexes so maybe I can index author and find my tweet however I can't imagine that being efficient if everyone does that. Also time has to be indexed.
So from what I understand (I think i saw some slides twitter released) twitter has a 'timeline' so everytime a person tweets twitter inserts the tweetid in everyone timeline which is indexed by date and when a given user browse it grabs available tweets sorted by time.
How would that be done in mongodb? The only solution I can think of is having a column in the tweet document saying {SendOut:DateStamp} which is removed when completed. If it didnt complete on the first attempt (checking timestamp to guess if it should be completed by now or not) then I would need to check all the watchers to see who hasn't received it and insert if they didn't. But also since theres no transactions i guess i need to index the SendOut column? Would this solution work? How would I efficiently insert a tweet and give it to everyone watching the user? (if this solution would not work)
It sounds like you're describing a model similar to pub/sub. Couldn't you instead just track that last post (by date) with each user object that the user last read? Users would requests tweets the same way, using various indexes including time.
I'm not sure what you need transactions for, but Mongo does support atomic operations.
[Updated]
So in other words, each user's object stores the dateTime of the last tweet read/delivered. Obviously you would also need the list of subscribed author IDs. To fetch new tweets you would ask for tweets indexed by both author_id,time properties and then sort by time.
By using the last read date from the user object and using it as the secondary index into your tweets collection, I don't believe you need either pub/sub or transactions to do it.
I might be missing something though.