Which is a more efficient way to implement database sharding, PostgreSQL's foreign data wrappers? or multiple unrelated Postgres instances? - postgresql

One way of implementing database sharding in postgresql 11 is partitioning the table and then using the foreign data wrapper to set it up so that the shards are running on their own containers. read more here
what you get with this approach is that you only deal with one database.
another way of implementing database sharding in postgresql 11 is basically running multiple instances of postgres and handling all the sharding logic using code. for example, having an extra field in the data table titled sharding_id which we can use to decide which instance we need to query to retrieve the data. if the sharding id is 1 then query instance 1.
which of these approaches is better in terms of performance?

This question would be as unanswerable as "what is better: PostgreSQL or Oracle", if sharding with foreign data wrappers were functional.
Alas, sharding by foreign data wrapper doesn't work yet. The missing link is that currently (v13), PostgreSQL cannot scan partitions that are foreign tables in parallel.

Related

Oracle index to AWS Redshift Sortkey

I am new to Redhsift and migrting oracle to Redshift.
One of the oracle tables have around 60 indexes. AWS recommends its a good practice to have around 6 compound sort keys.
How would these 60 oracle indexes translate to Redhsift sort keys ? I understand there is no automated conversion or can't have all 60 of them as compound sort keys. I am new to redshift and May I know , how usually this conversion is approached.
In Oracle we can keep adding indexes to the same table and the queries / reports can use them. But in Redshift Changing sortkeys is through recreating the table. How do we make all queries which uses different filter columns and join columns on the same table have best performance?
Thanks
Redshift is columnar database, and it doesn't have indexes in the same meaning as in Oracle at all.
You can think of Redshift's compound sort key (not interleaved) as IOT in Oracle (index organized table), with all the data sorted physically by this compound key.
If you create interleaved sort key on x columns, it will act as a separate index on each of x columns in some manner.
In any way, being columnar database, Redshift can outperform Oracle in many aggregation queries due to it's compression and data structure. The main factors that affect performance in Redshift are distribution style and key, sort key and columns encoding.
If you can't fit all your queries with one table structure, you can duplicate the table with different structure, but the same data. This approach is widely used in big data columnar databases (for example projections in Vertica) and helps to achieve performance with storage being the cost.
Please review this page with several useful tips about Redshift performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
First a few key points
Redshift <> Oracle
Redshift does not have indexes, Redshift Sort Keys <> Oracle Indexes.
Hopefully, you are not expecting Redshift to replace Oracle for your OLTP workload. most of those 60 indexes are likely to be for optimising OLTP type workload.
Max Redshift sortkeys per table = 1
You cannot sort your Redshift data in more than 1 way! the sort key orders your table data. It is not an index.
You can specify an interleaved or a compound sort key.
Query Tuning
Hopefully, you will be using Redshift for analytical type queries. You should define sort and distribution based upon your expected queries. you should follow the best practice here and the tutorial here
Tuning Redshift is partly an art, you will need to use trial and error!
If you want specific guidance on this, please can you edit your question to be specific on what you are doing?

How to design the tables/entities in NoSQL DB?

I do my first steps in NoSQL databases, thus I would like to hear the best practices about implementing the following requirement.
Let suppose I have a messages database, which is powered by MongoDB engine. This DB contains a collection of documents, where each document has the following fields:
time stamp;
message author/source;
message content.
Now, I want to build a list of authors/sources in order to add some metadata about each source. In the case of the classical RDBMS, I would define a table tblSources where I would store the names of the message sources and all additional meta-data (or links to the relevant tables) for each author.
What is the right approach to such task in NoSQL/MongoDB world?
It really depends on how you want to use the data. NoSQL dbs are generally not designed with fast joins in mind but they are still capable of doing joins and storing foreign keys.
Your options here are really
duplicate data aka store the author metadata in every document. This might be better in the case where you are really trying to optimize lookups and use Mongo as a key value store
Join on foreign key - this is pretty similar to how you would use a RDBMS

SQL like querying on a DB with more than 1 key in a table

We know that there is the concept of a primary key in traditional RDBMS systems. This primary key is basically used to index records in the table on this particular key for faster retrieval. I know that there are NOSQL stores like Cassandra which offer secondary key indexing but is there a way or an existing DB which follows exactly the same schema as in traditional RDBMS systems (i.e. a DB split into various tables to hold different kinds of data) but provides indexing on 2 or more keys.
An example of a use case for the same is:
There is a one-to-one mapping between 10 different people's names and their ages. Now if I keep this information in a table with the name of the person being the primary key, then retrieval of age given the name of a person is relatively faster than retrieving the name given the age of the person. If i could index both the columns, then the second case also would have been faster.
An alternative to doing this with traditional RDBMS would be to have 2 tables with the same data with just the difference that the primary key in one of them is the name and in the other, it is the age but that would be a wastage of a large amount of space in case of large number of records.
It is sad to see no response on this question for a very long time. In all this time of doing some research on the same , I found FastBit Index as one of the plausible solutions for doing indexing on virtually every column of the record in a table. It also provides SQL like semantics for querying data and delivers performance of the order of a few milliseconds when queried on millions of rows of data (of the order of GBs).
Please suggest if there are any other NOSQL or SQL DBs which can deliver similar kind of functionality with a good performance level.

InsertBatch in Sharding

What is actually happening behind the scene with a big InsertBatch if
one is writing to a sharded cluster? Does MongoDb actually support
bulk insert or the InserBatch is actually inserting one at a time at
the server level? How does this work with sharding then? Does this
mean that a mongos will look at every item in the batch to figure out
what is the shard key of each item and then will route it to the right
server? This will break bulk insert if it exist and does not seem to
be efficient. What is the mechanics of InsertBatch for a sharding
solution? I am using version 2.0 and willing to upgrade if that makes any difference
Bulk inserts are an actual MongoDB feature and are (somewhat) more performant than seperate per-document inserts due to less roundtrips.
In a sharded environment if mongos receives a bulk insert it will figure out which part of the bulk has to be sent to which shard. There are no differences between 2.0 and 2.1 and it is the most efficient way to bulk insert data into a sharded database.
If you're curious to how exactly mongos works have a look at it's source code here :
https://github.com/mongodb/mongo/tree/master/src/mongo/s

MongoDB sharding/partitioning 100+ million rows in one table on one machine

Does MongoDB support something akin to ordinary database partitioning on a single machine? I’m looking for very fast queries (writes are less important) on a single large table on one machine.
I tried sharding data using only one machine, in the hopes that it would behave similarly to ordinary database partitioning, but the performance was subpar. I also did a quick custom-code partitioning implementation, which was much faster. But I would prefer to use built-in MongoDB functionality if possible.
What are the best practices for MongoDB for this scenario?