DB Design for high volume chat messages

DB Design for high volume chat messages - chat

I am new to Couchbase,
I would like to understand how to model storing billions of chat messages originating from a typical IM app in Couchbase. What would be the correct way to model this in Couchbase? Assume 10000 new messages/sec inserts and 40000 updates on these 10000 messages/sec. Assume, one to one chat as the primary use case, although each person would have many buddies - pretty much like Whatsapp
Thanks, appreciate all feedback.
**Update: **
Thanks for your reply, here is my database design:
Sample data store on Couchbase (document store):
document User:
123_user => {"id" : 123, "friend_ids" : [456, 789, ...], "session": "123asdcas123123qsd"}
document History Message (channel_name = userId1 + "-to-" + userId2)
123-to-456_history => {"channel_name": "123-to-456", "message_ids" => ["545_message, 999_message, ...."]}
document Message:
545_message => {"id" : 545, client_id : 4143413, from_uid : 123, "to_uid" : 456, "body" : "Hello world", "create_time" : 1243124124, "state" : 1}
there is problem here, when message_ids field on History Message store million or a billion message ids, this is really a big problem when reading and writing messages history.
Can anyone give me a solution to this problem?

First of all, we need to put CouchBase aside. The key problem is how to model this application scenario, then we know if CouchBase is your best choice.
A one-to-one chat application can use each pair of chatters as a primary key.
For example, Bob-to-Jack, they chat:
1."hello!";
2."go for rest?";
3."no, i'm busy now.";
...
You will insert a new record with primary key "Bob-Jack", and value "hello; go for rest; no,....".
If the conversation stops, this record will stop growing and stored for future use.
If on the next day, the two guys chat again, your application will fetch out this record by key "Bob-Jack", display their yesterday conversation(the value), and update the value by appending new chat content to the end.
The length of the value grows, if it exceeds some threshold, you will split it into two records. As many DB systems have a size limitation for one record.
One guy has many buddies, so there are billions of pairs(keys) in real world, with each pair a long conversation(value). No-sql solutions are good choice for this data volume.
Then you may know if CouchBase is capable of this kind of task. I think it's OK but not the only-one choice.

Related

MongoDB Schema Design suggestion

I've used MongoDB for a while but i've only used it for doing CRUD operations when somebody else has already done the nitty-gritty task of designing a schema. So, basically this is the first time i'm designing a schema and i need some suggestions.
The data i will collect from users are their regular information, their health related information and their insurance related information. A single user will not have multiple health and insurance related information so it is a simple one-to-one relation. But these health and insurance related information will have lots of fields. So my question is. is it good to have a separate collection for health and insurance related information as :
var userSchema = {
name : String,
age : Number,
health_details : [{ type: Schema.Types.ObjectId, ref: 'Health' }],//reference to healthSchema
insurance_details : [{ type: Schema.Types.ObjectId, ref: 'Insurance' }] //reference to insuranceSchema
}
or to have a single collection with large number of fields as:
var userSchema = {
name : String,
age : Number,
disease_name : String, // and many other fields related to health
insurance_company_name : String //and many other fields related to insurance
}

Generally, some of the factors you can consider while modeling 1-to-1, 1-to-many and many-to-many data in NoSql are:
1. Data duplication
Do you expect data to duplicate? And that too not in a one word way like hobby "gardening", which many users can have and which probably doesn't need "hobbies" collection, but something like author and books. This case guarantees duplication.
An author can write many books. You should not be embedding author even in two books. It's hard to maintain when author info changes. Use 1-to-many. And reference can go in either of the two documents. As "has many" (array of bookIds in author) or "belongs to" (authorId in each book).
In case of health and insurance, as data duplication is not expected, single document is a better choice.
2. Read/write preference
What is the expected frequency of reads and writes of data (not collection)? For example, you query user, his health and insurance record much more frequently than updating it (and if 1 and 3 are not much of a problem) then this data should preferably be contained in and queried from a single document instead of three different sources.
Also, one document is what Mongodb guarantees atomicity for, which will be an added benefit if you want to update user, health and insurance all at the same time (say in one API).
3. Size of the document
Consider this: many users can like a post and a user can like many posts (many-to-many). And as you need to ensure no user likes a post twice, user ids must be stored somewhere. Three available options:
keep user ids array in post document
keep post ids array in user document
create another document that contains the ids of both (solution for many-to-many only, similar to SQL)
If a post is liked by more than a million users the post document will overflow with user references. Similarly, a user can like thousands of posts in a short period, so the second option is also not feasible. Which leaves us with the third option, which is the best for this case.
But a post can have many comments and a comment belongs to only one post (1-to-many). Now, comments you hardly expect more than a few hundreds. Rarely thousand. Therefore, keeping an array of commentIds (or embedded comments itself) in post is a practical solution.
In your case, I don't believe a document which does not keep a huge list of references can grow enough to reach 16 MB (Mongo document size limit). You can therefore safely store health and insurance data in user document. But they should have keys of their own like:
var userSchema = {
name : String,
age : Number,
health : {
disease_name : String,
//more health information
},
insurance :{
company_name : String,
//further insurance data
}
}
That's how you should think about designing your schema in my opinion. I would recommend reading these very helpful guides by Couchbase for data modeling: Document design considerations, modeling documents for retrieval and modeling relationships. Although related to Couchbase, the rules are equally applicable to mongodb schema design as both are NoSql and document oriented databases.

efficiancy of indexing frequently update collection in mongodb

I am newbie with MongoDB so my questions might be trivial ... I want to allow my users to upload their address book. the document have the following structure
{
"_id" : "56f29ecc2a00001800dbdf54",
"contacts" : [
{
"name" : "John",
"phoneNumber" : [
"+18144040000"
]
},
{
"name" : "Andrew ",
"phoneNumber" : [
"+14129123456"
]
}
]
}
I would like to run search by phone number in order to find useres with mutual contacts
i.e
{"contacts.phoneNumber":"+14129123456"}
my question is - will it be efficient to add this index
db.addresses.createIndex( { "contacts.phoneNumber": 1 }, { unique: false }, {background: true} )
considering the fact that the user will frequently update his address book from his phone which overrides the current data or insert new one. this upload is a single document with an array of contacts each hold an array of phone numbers .
each upload/update will contain hundreds/thousands of records ?

Your index make sense. Reg the efficiency there is a trade off between read and write. Typically, user interface is expected to respond quickly for any search (i.e. read). So, creating the index on specific field is inevitable. On that basis, indexing on "phone number" is fine considering that the use case required a search or query on phone number directly.
Indexing the document would degrade the write performance. However, this particular index wouldn't degrade the write performance drastically. Having said that if it takes more time, you may need to reconsider the UI design to have progress bar for upload which is a typical UI design for any large uploads.
Also, you can check about the write concern option available in MongoDB. You can configure the write concern whether you are expecting the acknowledgement from the drive or not.
If you consider going with write concern without acknowledgement, it would give you better write performance. However, most of the applications expect acknowledgement on writes to ensure that the write is successful.
https://docs.mongodb.com/manual/reference/write-concern/

API pagination best practices

I'd love some some help handling a strange edge case with a paginated API I'm building.
Like many APIs, this one paginates large results. If you query /foos, you'll get 100 results (i.e. foo #1-100), and a link to /foos?page=2 which should return foo #101-200.
Unfortunately, if foo #10 is deleted from the data set before the API consumer makes the next query, /foos?page=2 will offset by 100 and return foos #102-201.
This is a problem for API consumers who are trying to pull all foos - they will not receive foo #101.
What's the best practice to handle this? We'd like to make it as lightweight as possible (i.e. avoiding handling sessions for API requests). Examples from other APIs would be greatly appreciated!

I'm not completely sure how your data is handled, so this may or may not work, but have you considered paginating with a timestamp field?
When you query /foos you get 100 results. Your API should then return something like this (assuming JSON, but if it needs XML the same principles can be followed):
{
"data" : [
{ data item 1 with all relevant fields },
{ data item 2 },
...
{ data item 100 }
],
"paging": {
"previous": "http://api.example.com/foo?since=TIMESTAMP1"
"next": "http://api.example.com/foo?since=TIMESTAMP2"
}
}
Just a note, only using one timestamp relies on an implicit 'limit' in your results. You may want to add an explicit limit or also use an until property.
The timestamp can be dynamically determined using the last data item in the list. This seems to be more or less how Facebook paginates in its Graph API (scroll down to the bottom to see the pagination links in the format I gave above).
One problem may be if you add a data item, but based on your description it sounds like they would be added to the end (if not, let me know and I'll see if I can improve on this).

If you've got pagination you also sort the data by some key. Why not let API clients include the key of the last element of the previously returned collection in the URL and add a WHERE clause to your SQL query (or something equivalent, if you're not using SQL) so that it returns only those elements for which the key is greater than this value?

You have several problems.
First, you have the example that you cited.
You also have a similar problem if rows are inserted, but in this case the user get duplicate data (arguably easier to manage than missing data, but still an issue).
If you are not snapshotting the original data set, then this is just a fact of life.
You can have the user make an explicit snapshot:
POST /createquery
filter.firstName=Bob&filter.lastName=Eubanks
Which results:
HTTP/1.1 301 Here's your query
Location: http://www.example.org/query/12345
Then you can page that all day long, since it's now static. This can be reasonably light weight, since you can just capture the actual document keys rather than the entire rows.
If the use case is simply that your users want (and need) all of the data, then you can simply give it to them:
GET /query/12345?all=true
and just send the whole kit.

There may be two approaches depending on your server side logic.
Approach 1: When server is not smart enough to handle object states.
You could send all cached record unique id’s to server, for example ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"] and a boolean parameter to know whether you are requesting new records(pull to refresh) or old records(load more).
Your sever should responsible to return new records(load more records or new records via pull to refresh) as well as id’s of deleted records from ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"].
Example:-
If you are requesting load more then your request should look something like this:-
{
"isRefresh" : false,
"cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10"]
}
Now suppose you are requesting old records(load more) and suppose "id2" record is updated by someone and "id5" and "id8" records is deleted from server then your server response should look something like this:-
{
"records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
"deleted" : ["id5","id8"]
}
But in this case if you’ve a lot of local cached records suppose 500, then your request string will be too long like this:-
{
"isRefresh" : false,
"cached" : ["id1","id2","id3","id4","id5","id6","id7","id8","id9","id10",………,"id500"]//Too long request
}
Approach 2: When server is smart enough to handle object states according to date.
You could send the id of first record and the last record and previous request epoch time. In this way your request is always small even if you’ve a big amount of cached records
Example:-
If you are requesting load more then your request should look something like this:-
{
"isRefresh" : false,
"firstId" : "id1",
"lastId" : "id10",
"last_request_time" : 1421748005
}
Your server is responsible to return the id’s of deleted records which is deleted after the last_request_time as well as return the updated record after last_request_time between "id1" and "id10" .
{
"records" : [
{"id" :"id2","more_key":"updated_value"},
{"id" :"id11","more_key":"more_value"},
{"id" :"id12","more_key":"more_value"},
{"id" :"id13","more_key":"more_value"},
{"id" :"id14","more_key":"more_value"},
{"id" :"id15","more_key":"more_value"},
{"id" :"id16","more_key":"more_value"},
{"id" :"id17","more_key":"more_value"},
{"id" :"id18","more_key":"more_value"},
{"id" :"id19","more_key":"more_value"},
{"id" :"id20","more_key":"more_value"}],
"deleted" : ["id5","id8"]
}
Pull To Refresh:-
Load More

It may be tough to find best practices since most systems with APIs don't accommodate for this scenario, because it is an extreme edge, or they don't typically delete records (Facebook, Twitter). Facebook actually says each "page" may not have the number of results requested due to filtering done after pagination.
https://developers.facebook.com/blog/post/478/
If you really need to accommodate this edge case, you need to "remember" where you left off. jandjorgensen suggestion is just about spot on, but I would use a field guaranteed to be unique like the primary key. You may need to use more than one field.
Following Facebook's flow, you can (and should) cache the pages already requested and just return those with deleted rows filtered if they request a page they had already requested.

Option A: Keyset Pagination with a Timestamp
In order to avoid the drawbacks of offset pagination you have mentioned, you can use keyset based pagination. Usually, the entities have a timestamp that states their creation or modification time. This timestamp can be used for pagination: Just pass the timestamp of the last element as the query parameter for the next request. The server, in turn, uses the timestamp as a filter criterion (e.g. WHERE modificationDate >= receivedTimestampParameter)
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757071}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"lastModificationDate": 1512757072,
"nextPage": "https://domain.de/api/elements?modifiedSince=1512757072"
}
}
This way, you won't miss any element. This approach should be good enough for many use cases. However, keep the following in mind:
You may run into endless loops when all elements of a single page have the same timestamp.
You may deliver many elements multiple times to the client when elements with the same timestamp are overlapping two pages.
You can make those drawbacks less likely by increasing the page size and using timestamps with millisecond precision.
Option B: Extended Keyset Pagination with a Continuation Token
To handle the mentioned drawbacks of the normal keyset pagination, you can add an offset to the timestamp and use a so-called "Continuation Token" or "Cursor". The offset is the position of the element relative to the first element with the same timestamp. Usually, the token has a format like Timestamp_Offset. It's passed to the client in the response and can be submitted back to the server in order to retrieve the next page.
{
"elements": [
{"data": "data", "modificationDate": 1512757070}
{"data": "data", "modificationDate": 1512757072}
{"data": "data", "modificationDate": 1512757072}
],
"pagination": {
"continuationToken": "1512757072_2",
"nextPage": "https://domain.de/api/elements?continuationToken=1512757072_2"
}
}
The token "1512757072_2" points to the last element of the page and states "the client already got the second element with the timestamp 1512757072". This way, the server knows where to continue.
Please mind that you have to handle cases where the elements got changed between two requests. This is usually done by adding a checksum to the token. This checksum is calculated over the IDs of all elements with this timestamp. So we end up with a token format like this: Timestamp_Offset_Checksum.
For more information about this approach check out the blog post "Web API Pagination with Continuation Tokens". A drawback of this approach is the tricky implementation as there are many corner cases that have to be taken into account. That's why libraries like continuation-token can be handy (if you are using Java/a JVM language). Disclaimer: I'm the author of the post and a co-author of the library.

Pagination is generally a "user" operation and to prevent overload both on computers and the human brain you generally give a subset. However, rather than thinking that we don't get the whole list it may be better to ask does it matter?
If an accurate live scrolling view is needed, REST APIs which are request/response in nature are not well suited for this purpose. For this you should consider WebSockets or HTML5 Server-Sent Events to let your front end know when dealing with changes.
Now if there's a need to get a snapshot of the data, I would just provide an API call that provides all the data in one request with no pagination. Mind you, you would need something that would do streaming of the output without temporarily loading it in memory if you have a large data set.
For my case I implicitly designate some API calls to allow getting the whole information (primarily reference table data). You can also secure these APIs so it won't harm your system.

Just to add to this answer by Kamilk : https://www.stackoverflow.com/a/13905589
Depends a lot on how large dataset you are working on. Small data sets do work on effectively on offset pagination but large realtime datasets do require cursor pagination.
Found a wonderful article on how Slack evolved its api's pagination as there datasets increased explaining the positives and negatives at every stage : https://slack.engineering/evolving-api-pagination-at-slack-1c1f644f8e12

I think currently your api's actually responding the way it should. The first 100 records on the page in the overall order of objects you are maintaining. Your explanation tells that you are using some kind of ordering ids to define the order of your objects for pagination.
Now, in case you want that page 2 should always start from 101 and end at 200, then you must make the number of entries on the page as variable, since they are subject to deletion.
You should do something like the below pseudocode:
page_max = 100
def get_page_results(page_no) :
start = (page_no - 1) * page_max + 1
end = page_no * page_max
return fetch_results_by_id_between(start, end)

Another option for Pagination in RESTFul APIs, is to use the Link header introduced here. For example Github use it as follow:
Link: <https://api.github.com/user/repos?page=3&per_page=100>; rel="next",
<https://api.github.com/user/repos?page=50&per_page=100>; rel="last"
The possible values for rel are: first, last, next, previous. But by using Link header, it may be not possible to specify total_count (total number of elements).

I've thought long and hard about this and finally ended up with the solution I'll describe below. It's a pretty big step up in complexity but if you do make this step, you'll end up with what you are really after, which is deterministic results for future requests.
Your example of an item being deleted is only the tip of the iceberg. What if you are filtering by color=blue but someone changes item colors in between requests? Fetching all items in a paged manner reliably is impossible... unless... we implement revision history.
I've implemented it and it's actually less difficult than I expected. Here's what I did:
I created a single table changelogs with an auto-increment ID column
My entities have an id field, but this is not the primary key
The entities have a changeId field which is both the primary key as well as a foreign key to changelogs.
Whenever a user creates, updates or deletes a record, the system inserts a new record in changelogs, grabs the id and assigns it to a new version of the entity, which it then inserts in the DB
My queries select the maximum changeId (grouped by id) and self-join that to get the most recent versions of all records.
Filters are applied to the most recent records
A state field keeps track of whether an item is deleted
The max changeId is returned to the client and added as a query parameter in subsequent requests
Because only new changes are created, every single changeId represents a unique snapshot of the underlying data at the moment the change was created.
This means that you can cache the results of requests that have the parameter changeId in them forever. The results will never expire because they will never change.
This also opens up exciting feature such as rollback / revert, synching client cache etc. Any features that benefit from change history.

Refer to API Pagination Design, we could design pagination api through cursor
They have this concept, called cursor — it’s a pointer to a row. So you can say to a database “return me 100 rows after that one”. And it’s much easier for a database to do since there is a good chance that you’ll identify the row by a field with an index. And suddenly you don’t need to fetch and skip those rows, you’ll go directly past them.
An example:
GET /api/products
{"items": [...100 products],
"cursor": "qWe"}
API returns an (opaque) string, which you can use then to retrieve the next page:
GET /api/products?cursor=qWe
{"items": [...100 products],
"cursor": "qWr"}
Implementation-wise there are many options. Generally, you have some ordering criteria, for example, product id. In this case, you’ll encode your product id with some reversible algorithm (let’s say hashids). And on receiving a request with the cursor you decode it and generate a query like WHERE id > :cursor LIMIT 100.
Advantage:
The query performance of db could be improved through cursor
Handle well when new content was inserted into db while querying
Disadvantage:
It’s impossible to generate a previous page link with a stateless API

what is efficient way to structure multiple data in mongodb

i am writing an application using mongodb in which user can send messages to each other here are the fields i want to store
user_to
user_from
message
sent
ip
unread
what i am asking is if user A sends msg to user B for eg:
user_to : B,
user_from : A,
message :hello world,
sent:1334901545,
ip : XX.XX.XX.X,
unred : true
this is the format how data is stored for first time but when again user A sends message to user B how should i store that data should ??
i have thought of two ways please tell me which is most efficient
user_to : B,
user_from : A,
message :hello world again,
sent:1334901745,
ip : XX.XX.XX.X,
unred : true
or
user_to : B,
user_from : A,
messages : array(
array(message="hello world",sent="1334901545",ip: XXX.X.XX.XX,unread:true),
array(message="hello world again",sent="1334901745",ip:XX.XX.XX.X,unread:true )
)
first type is very simple in structure better for querying and very easy to analyse but i think it increase duplicate data and will consume larger disk space as application grows
whereas second type eliminates duplicate record and low on disk space but it complicates data retrieval difficult to read cannot be used efficiently to write amd many other complication
i just want to know which is the right way considering mongodb is highly scaleable should i go for simplicity or eliminate duplicate records and save disk space

Separate records will be better for query speed, I would go with that. Also in your example there is not so much redundant data. If there was you could always factor it out into a new collection.

Can one make a relational database using MongoDB?

I am going to make a student management system using MongoDB. I will have one table for students and another for attendance records. Can I have a key in the attendance table to reach the students table, as pictured below? How?

The idea behind MongoDB is to eliminate (or at least minimize) relational data. Have you considered just embedding the attendance data directly into each student record? This is actually the preferred design pattern for MongoDB and can result in much better performance and scalability.
If you truly need highly relational and normalized data, you might want to reconsider using MongoDB.

The answer depends on how you intend to use the data. You really have 2 options, embed the attendance table, or link it. More on these approaches is detailed here: http://www.mongodb.org/display/DOCS/Schema+Design
For the common use-case, you would probably embed this particular collection, so each student record would have an embedded "attendance" table. This would work because attendance records are unlikely to be shared between students, and retrieving the attendance data is likely to require the student information as well. Retrieving the attendance data would be as simple as:
db.student.find( { login : "sean" } )
{
login : "sean",
first : "Sean",
last : "Hodges",
attendance : [
{ class : "Maths", when : Date("2011-09-19T04:00:10.112Z") },
{ class : "Science", when : Date("2011-09-20T14:36:06.958Z") }
]
}

Yes. There are no hard and fast rules. You have to look at the pros and cons of either embedding or referencing data. This video will definitely help (https://www.youtube.com/watch?v=-o_VGpJP-Q0&t=21s). In your example, the phone number attribute should be on the same table (in a document database), because the phone number of a person rarely changes.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse