I am designing a sharded database. Many times we use two columns, first for logical shard and second for uniquely identifying a row in the shard. Instead of that I am planning to have just a 1 column with long datatype for primary key. To have unique key throughout the servers I am planning to have bigserial that will generate non overlapping range.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
39,999,999,999
5
40,000,000,000
49,999,999,999
and so on.
In future I should be able to
Split large server to two or more small servers
Join two or more small servers to 1 big server
Move some rows from Server A to Server B for better utilization of resources.
I will also have a lookup table to which will contain information on range and target server.
I would like to learn about drawbacks of this approach.
I recommend that you create the primary key column on server 1 like this:
CREATE TABLE ... (
id bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1000),
...
);
On the second server you use START WITH 2, and so on.
Adding a new shard is easy, just use a new start value. Splitting or joining shards is trivial (as far as the primary keys are concerned...), because new value can never conflict with old values, and each shard generates different values.
The two most common types of sharding keys are basically:
Based on a deterministic expression, like the modulus-based method suggested in #LaurenzAlbe's answer.
Based on a lookup table, like the method you describe in your post.
A drawback of the latter type is that your app has to check the lookup table frequently (perhaps even on every query), because the ranges could change. The ranges stored in the lookup table might be a good thing to put in cache, to avoid frequent SQL queries. Then replace the cached ranges when you change them. I assume this will be infrequent. With a cache, it's a pretty modest drawback.
I worked on a system like this, where we had a schema for each of our customers, distributed over 8 shards. At the start of each session, the app queried the lookup table to find out which shard the respective customer's data was stored on. Once a year or so, we would move some customer schemas around to new shards, because naturally they tend to grow. This included updating the lookup table. It wasn't a bad solution.
I suggest that you will eventually have multiple non-contiguous ranges per server, because there are hotspots of data or traffic, and it makes sense to split the ranges if you want to move the least amount of data.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
32,999,999,999
3
33,000,000,000
29,999,999,999
5
40,000,000,000
49,999,999,999
If you anticipate moving subsets of data from time to time, this can be a better design than the expression-based type of sharding. If you use an expression, and need to move data between shards, you must either make a more complex expression, or else rebalance a lot of data.
Suppose, I have a table (customers) in Oracle with column names as customer_id(PK), customer_name, customer_email, customer address. And I have a collection (products) in MongoDB which is storing customer_id as one of its field. Below, is a sample of document in products collection, which is storing customer_id "customer123", which is primary key in customers table in Oracle database.
{
_id : "product124",
customer_id: "customer123",
product_name: "hairdryer"
}
My questions is, Is it a good idea to use different types of databases when one field like customer_id here is shared between them. Is it a good practice in enterprises level development?
Please ignore the use case, as I am just trying to give a simple example to provide better understanding of the problem.
I would say it is acceptable to use different databases in distributed systems and keep references between entities, but it really depends on the use case. If you plan to perform frequent and heavy joins between these 2 entities then storing them in separated databases (especially of different types) might dramatically affect your performance. However, if your use case does not require frequent relations resolving, this approach could work. But bear in mind that you need to consider the future scale of your application and how would this architectural decision affect the potential growth.
I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.
While reading the article http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 chapter: "Rules of Thumb: Your Guide Through the Rainbow" i came across the words: embedding and denormalizing.
One: favor embedding unless there is a compelling reason not to
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
I know embedding is nesting documents, instead of writing seperate tables/collections.
But i have no clue what denormalizing means.
Denormalization is the opposite of normalization, a good practice to design relational databases like MySQL, where you split up one logical dataset into separated tables to reduce redundancy.
Because MongoDB does not support joins of tables you prefere to duplicate an attribute into two collections if the attribute is read often and less updated.
E.G.: You want to save data about a person and want to save the gender of the Person:
In a SQL database you would create a table person and a table gender, store a foreign key of the genders ID in the person table and perform a join to get the full information. This way "male" and "female" only exists once as a value.
In MongoDB you would save the gender in every Person so the value "male" and "female" can exist multiple times but the lookup is faster because the data is tightly coupled in one object of a collection.
I come from RDBMS background and designing an app with Cassandra as backend and I am unsure of the validity and scalability of my design.
I am working on some sort of rating/feedback app of books/movies/etc. Since Cassandra has the concept of flexible column families (sparse structure), I thought of using the following schema:
user-id (row key): book-id/movie-id (dynamic column name) - rating (column value)
If I do it this way, I would end up having millions of columns (which would have been rows in RDBMS) though not essentially associated with a row-key, for instance:
user1: {book1:Rating-Ok; book1023:good; book982821:good}
user2: {book75:Ok;book1023:good;book44511:Awesome}
Since all column families are stored in a single file, I am not sure if this is a scalable design (or a design at all!). Furthermore there might be queries like "pick all 'good' reviews of 'book125'".
What approach should I use?
This design is perfectly scalable. Cassandra stores data in sparse form, so empty cells don't consume disk space.
The drawback is that cassandra isn't very good in indexing by value. There are secondary indexes, but they should be used only to index a column or two, not each of million of columns.
There are two options to address this issue:
Materialised views (described, for example, here: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/). This allows to build some set of predefined queries, probably quite complex ones.
Ad-hoc querying is possible with some sort of map/reduce job, that effectively iterates over the whole dataset. This might sound scary, but still it's pretty fast: Cassandra stores all data in SSTables, and this iterating might be implemented to scan data files sequentially.
Start from a desired set of queries and structure your column families to support those views. Especially with so few fields involved, each CF can act cheaply as its own indexed view of your data. During a fetch, the key will partition the data ultimately to one specific Cassandra node that can rapidly stream a set of wide rows to your app server in a pre-determined order. This plays to one of Cassandra's strengths, as the fragmentation of that read on physical media (when not cached) is extremely low compared to bouncing around the various tracks and sectors on an indexed search of an RDBMS table.
One useful approach when available is to select your key to segment the data such that a full scan of all columns in that segment is a reasonable proposition, and a good rough fit for your query. Then, you filter what you don't need, even if that filtering is performed in your client (app server). All reviews for a movie is a good example. Even if you filter the positive reviews or provide only recent reviews or a summary, you might still reasonably fetch all rows for that key and then toss what you don't need.
Another option is if you can figure out how to partition data(by time, by category), playOrm offers a solution of doing S-SQL into a partition which is very fast. It is very much like an RDBMS EXCEPT that you partition the data to stay scalable and can have as many partitions as you want. partitions can contain millions of rows(I would not exceed 10 million rows though in a partition).
later,
Dean