Cloudant partitions design recommandations - ibm-cloud

I am migrating a Cloudant database without partitions to the new partition system of Cloudant to reduce the cost in my ibm cloud account. The context can be summarized like so :
I am dealing with emails object which have a category name
I might receive more dans 100 new entries (emails) per day
The UI can query the emails from date A to date B and also on categories C1, C2, ... C100 in any combination possible of categories.
The UI displays only 15 emails/page
The question is about the partitioning of such a data model and avoid as much as possible global queries (cross partitions) which are way more costly than partition based queries.
I thought first I would go for a partitioning per day but eventually I can end up with one situation where the query filters emails on a specific category Cn on 4 months but the specific category receives only 1 email per day which means that to display one page on the UI (of 15 emails) I should do 15 queries which is not acceptable.
Before the partitioning arrival, I was just doing global queries with the Lucene query engine but that is not anymore because of the cost.
Also, I also considered putting all the emails in a single partition so that I would be able to use the same old query within that partition and since it is a partition, I would not hit the global query price but the partition query price. That theoretically work but might have some limits I guess since the documentation about partitions recommends not to put "too many data" in a single partition.
Do you by any mean have any recommandation for this kind of problem ?
Thanks.

Given your design, it doesn't seem to me like there is a partition key that will allow you to avoid global queries completely. As a rule of thumb, pick a partition key that would allow you to retrieve all data that make up a logical grouping. For example, imagine an order system where you have a set of customers with associated orders -- the obvious partition key would be a unique customer id: you then have a logical grouping of all data associated with each customer.
Over on the Cloudant blog, there is a good article series on partitions:
https://blog.cloudant.com/2019/03/05/Partition-Databases-Data-Design.html

Related

The point of Cosmos DB value uniqueness only per shard key (partition key)

Microsoft's documentation of Managing indexing in Azure Cosmos DB's API for MongoDB states that:
Azure Cosmos DB's API for MongoDB server version 3.6 automatically
indexes the _id field, which can't be dropped. It automatically
enforces the uniqueness of the _id field per shard key.
I'm confused about the reasoning behind "per shard key" part. I see it as "you're unique field won't be globally unique at all" because if I understand it correctly, if I have the Guid field _id as unique and userId field as the partition key then I can have 2 elements with the same ID provided that they happen to belong to 2 different users.
Is it that I fail to pick the right partition key? Because in my understanding partition key should be the field that is the most frequently used for filtering the data. But what if I need to select the data from the database only by having the ID field value? Or query the data for all users?
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
TL;DR
Is it the inherent limits in distributed systems that I need to accept and therefore remodel my process of designing a database and programming the access to it? Which in this case would be: ALWAYS query your data from this collection not only by _id field but first by userId field? And not treat my _id field alone as an identifier but rather see an identifier as a compound of userId and _id?
Yes. Mostly.
Longer version
While this id not field not being unique is not intuitive at first sight, it actually makes sense, considering CosmosDB seeks unlimited scale for pinpoint GET/PUT operations. This requires the partitions to act independently and this is where a lot of the magic comes from. If id or other unique constraint uniqueness would have been enforced globally, then every document change would have to coordinate with all other partitions and that would no longer be optimal or predictable in endless scale.
I also think this design decision of separation of data is in alignment with the schemaless distributed mindset of CosmosDB. If you use CosmosDB then embrace this and avoid trying to force cross-document relation constraints to it. Manage them in data/api design and client logic layer instead. For example, by using a guid for id.
About partition key..
Is it that I fail to pick the right partition key? [...] partition key should be the field that is the most frequently used for filtering the data.
It depends;). You have to think also for worst query performance, not only the "most frequently" used ones. Make sure MOST queries can go directly to correct partition, meaning you MUST know the exact target partition key before making those queries, even for those "get by id"-queries. Measure the cost for left cross-partition queries on realistic data set.
It is difficult to say whether userId is a good key or not. It most likely is known in advance and could be included to get-by-id queries, so it's good in that sense. But you should also consider:
hot partition - all single user queries would go to single partition, no scale out there.
partition size - single user data most likely grows-and-grows-and-grows. Partitions have a max size limits and working within those target partitions will become costlier over time.
So, if possible, I would define smaller partitions to distribute the load further. Maybe consider using a composite partition key or similar tactics to split user partition to multiple smaller ones. Or to the very extreme of having id itself a partition key, which is good for writes and get-by-id but less optimal for everything else.
.. just always make sure to have the chosen partition key at hand.

How do you scale postgres to billions of rows for this schema?

Consider this scenario.
You're a link shortening service, and you have two tables:
Links
Clicks - predominantly append only, but will need a full scan to produce aggregates, which should be (but probably won't be) quick.
Links is millions of rows, Clicks is billions of rows.
Should you split these onto separate hardware? What's the right approach to getting the most out of postgres for this sort of problem?
With partitioning, it should be scalable enough. Partition links on hash of the shortened link (the key used for retrieval). Depending on your aggregation and reporting needs, you might partition clicks by date (maybe one partition per day?). When you create a new partition, the old one can be summed and moved to history (or removed, if the summed data is enough for your needs.
In addition to partitioning, I suggest pre-aggregating the data. If you never need the individual data, but only aggregates per day, then perform the aggregation and materialize it in another table after each day is over. That will reduce the amount considerably and make the data manageable.

Distributing big data storage for non-relational data

The problem consists of a lot (apprx. 500 million per day) of non-relational messages of relatively small size (apprx. 1KB). The messages are written once and never modified again. The messages has various structures, though there are patterns that the message must fit in. This data then must be used to make a search over them. The search may be done on any fields of the message, the only always present field is the date, thus the search will be done for a specific day.
The approach I have come up so far is to use MongoDB. Each day I create a few collections (apprx. 2000) and distribute messages during the day to those collections according to the pattern. I find the patterns important because I make indexing that the number of indexes is limited to 64.
This strategy results in 500G of data + 150G of indexes = 650G per day. Of course, the question here is how to distribute those data? Obvious solution is to use Mongo Sharding and spread the collections over the shards. However, I have not find any scenario close to my problem described in mongo manuals. Moreover, I am not even sure if I can dynamically (not manually) add new collections every day to shards. Any knowledge/suggestions from expreinced users? Shoudl I change my design?

Read Model Partition Key Strategy

I have a collection of documents that looks like the following:
There is one document per VIN/SiteID and our access pattern is showing all documents
at a specific site. I see two potential partition keys we could choose from:
SiteID - We only have 75 sites so the cardinality is not very high. Also, the doucments are not very big so the 10GB limit is probably OK.
SiteID/VIN: The data is now more evenly distributed but now that means each logical partition will only store one item. is this an anti-pattern? also, so support our access pattern we will need to use a cross-partition query. again, the data set is small so is this a problem?
Based on what I am describing, which partition key makes more sense?
Any other suggestions would be greatly appreciated!
Your first option makes a lot of sense and could be a good partition key but the words "probably OK" don't really breed confidence. Remember, the only way to change the partition key is to migrate to a new collection. If you can take that risk then SiteId (which I'm guessing you will always have) is a good partition key.
If you have both VIN and SiteId when you are doing the reading or querying then this is the safer combination. There is no problem with having each logical partition to store one item per se. It's only a problem when you are doing cross partition queries. If you know both VIN and SiteId in your queries then it's a great plan.
You also have to remember that your RUs are evenly split between your partitions inside a collection.

Mongo Architecture Efficiency

I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!