SQL like querying on a DB with more than 1 key in a table

SQL like querying on a DB with more than 1 key in a table - nosql

We know that there is the concept of a primary key in traditional RDBMS systems. This primary key is basically used to index records in the table on this particular key for faster retrieval. I know that there are NOSQL stores like Cassandra which offer secondary key indexing but is there a way or an existing DB which follows exactly the same schema as in traditional RDBMS systems (i.e. a DB split into various tables to hold different kinds of data) but provides indexing on 2 or more keys.
An example of a use case for the same is:
There is a one-to-one mapping between 10 different people's names and their ages. Now if I keep this information in a table with the name of the person being the primary key, then retrieval of age given the name of a person is relatively faster than retrieving the name given the age of the person. If i could index both the columns, then the second case also would have been faster.
An alternative to doing this with traditional RDBMS would be to have 2 tables with the same data with just the difference that the primary key in one of them is the name and in the other, it is the age but that would be a wastage of a large amount of space in case of large number of records.

It is sad to see no response on this question for a very long time. In all this time of doing some research on the same , I found FastBit Index as one of the plausible solutions for doing indexing on virtually every column of the record in a table. It also provides SQL like semantics for querying data and delivers performance of the order of a few milliseconds when queried on millions of rows of data (of the order of GBs).
Please suggest if there are any other NOSQL or SQL DBs which can deliver similar kind of functionality with a good performance level.

Related

What are the drawbacks of per server long range as primary keys for sharding database?

I am designing a sharded database. Many times we use two columns, first for logical shard and second for uniquely identifying a row in the shard. Instead of that I am planning to have just a 1 column with long datatype for primary key. To have unique key throughout the servers I am planning to have bigserial that will generate non overlapping range.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
39,999,999,999
5
40,000,000,000
49,999,999,999
and so on.
In future I should be able to
Split large server to two or more small servers
Join two or more small servers to 1 big server
Move some rows from Server A to Server B for better utilization of resources.
I will also have a lookup table to which will contain information on range and target server.
I would like to learn about drawbacks of this approach.

I recommend that you create the primary key column on server 1 like this:
CREATE TABLE ... (
id bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1000),
...
);
On the second server you use START WITH 2, and so on.
Adding a new shard is easy, just use a new start value. Splitting or joining shards is trivial (as far as the primary keys are concerned...), because new value can never conflict with old values, and each shard generates different values.

The two most common types of sharding keys are basically:
Based on a deterministic expression, like the modulus-based method suggested in #LaurenzAlbe's answer.
Based on a lookup table, like the method you describe in your post.
A drawback of the latter type is that your app has to check the lookup table frequently (perhaps even on every query), because the ranges could change. The ranges stored in the lookup table might be a good thing to put in cache, to avoid frequent SQL queries. Then replace the cached ranges when you change them. I assume this will be infrequent. With a cache, it's a pretty modest drawback.
I worked on a system like this, where we had a schema for each of our customers, distributed over 8 shards. At the start of each session, the app queried the lookup table to find out which shard the respective customer's data was stored on. Once a year or so, we would move some customer schemas around to new shards, because naturally they tend to grow. This included updating the lookup table. It wasn't a bad solution.
I suggest that you will eventually have multiple non-contiguous ranges per server, because there are hotspots of data or traffic, and it makes sense to split the ranges if you want to move the least amount of data.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
32,999,999,999
3
33,000,000,000
29,999,999,999
5
40,000,000,000
49,999,999,999
If you anticipate moving subsets of data from time to time, this can be a better design than the expression-based type of sharding. If you use an expression, and need to move data between shards, you must either make a more complex expression, or else rebalance a lot of data.

Storing primary key of Relational database in document of MongoDB

Suppose, I have a table (customers) in Oracle with column names as customer_id(PK), customer_name, customer_email, customer address. And I have a collection (products) in MongoDB which is storing customer_id as one of its field. Below, is a sample of document in products collection, which is storing customer_id "customer123", which is primary key in customers table in Oracle database.
{
_id : "product124",
customer_id: "customer123",
product_name: "hairdryer"
}
My questions is, Is it a good idea to use different types of databases when one field like customer_id here is shared between them. Is it a good practice in enterprises level development?
Please ignore the use case, as I am just trying to give a simple example to provide better understanding of the problem.

I would say it is acceptable to use different databases in distributed systems and keep references between entities, but it really depends on the use case. If you plan to perform frequent and heavy joins between these 2 entities then storing them in separated databases (especially of different types) might dramatically affect your performance. However, if your use case does not require frequent relations resolving, this approach could work. But bear in mind that you need to consider the future scale of your application and how would this architectural decision affect the potential growth.

Redshift Composite Sortkey - how many columns should we use?

I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?

VACUUM will also take longer if you have several sort keys.

What does denormalizing mean?

While reading the article http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 chapter: "Rules of Thumb: Your Guide Through the Rainbow" i came across the words: embedding and denormalizing.
One: favor embedding unless there is a compelling reason not to
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
I know embedding is nesting documents, instead of writing seperate tables/collections.
But i have no clue what denormalizing means.

Denormalization is the opposite of normalization, a good practice to design relational databases like MySQL, where you split up one logical dataset into separated tables to reduce redundancy.
Because MongoDB does not support joins of tables you prefere to duplicate an attribute into two collections if the attribute is read often and less updated.
E.G.: You want to save data about a person and want to save the gender of the Person:
In a SQL database you would create a table person and a table gender, store a foreign key of the genders ID in the person table and perform a join to get the full information. This way "male" and "female" only exists once as a value.
In MongoDB you would save the gender in every Person so the value "male" and "female" can exist multiple times but the lookup is faster because the data is tightly coupled in one object of a collection.

Cassandra DB Design

I come from RDBMS background and designing an app with Cassandra as backend and I am unsure of the validity and scalability of my design.
I am working on some sort of rating/feedback app of books/movies/etc. Since Cassandra has the concept of flexible column families (sparse structure), I thought of using the following schema:
user-id (row key): book-id/movie-id (dynamic column name) - rating (column value)
If I do it this way, I would end up having millions of columns (which would have been rows in RDBMS) though not essentially associated with a row-key, for instance:
user1: {book1:Rating-Ok; book1023:good; book982821:good}
user2: {book75:Ok;book1023:good;book44511:Awesome}
Since all column families are stored in a single file, I am not sure if this is a scalable design (or a design at all!). Furthermore there might be queries like "pick all 'good' reviews of 'book125'".
What approach should I use?

This design is perfectly scalable. Cassandra stores data in sparse form, so empty cells don't consume disk space.
The drawback is that cassandra isn't very good in indexing by value. There are secondary indexes, but they should be used only to index a column or two, not each of million of columns.
There are two options to address this issue:
Materialised views (described, for example, here: http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/). This allows to build some set of predefined queries, probably quite complex ones.
Ad-hoc querying is possible with some sort of map/reduce job, that effectively iterates over the whole dataset. This might sound scary, but still it's pretty fast: Cassandra stores all data in SSTables, and this iterating might be implemented to scan data files sequentially.

Start from a desired set of queries and structure your column families to support those views. Especially with so few fields involved, each CF can act cheaply as its own indexed view of your data. During a fetch, the key will partition the data ultimately to one specific Cassandra node that can rapidly stream a set of wide rows to your app server in a pre-determined order. This plays to one of Cassandra's strengths, as the fragmentation of that read on physical media (when not cached) is extremely low compared to bouncing around the various tracks and sectors on an indexed search of an RDBMS table.
One useful approach when available is to select your key to segment the data such that a full scan of all columns in that segment is a reasonable proposition, and a good rough fit for your query. Then, you filter what you don't need, even if that filtering is performed in your client (app server). All reviews for a movie is a good example. Even if you filter the positive reviews or provide only recent reviews or a summary, you might still reasonably fetch all rows for that key and then toss what you don't need.

Another option is if you can figure out how to partition data(by time, by category), playOrm offers a solution of doing S-SQL into a partition which is very fast. It is very much like an RDBMS EXCEPT that you partition the data to stay scalable and can have as many partitions as you want. partitions can contain millions of rows(I would not exceed 10 million rows though in a partition).
later,
Dean

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse