I have an existing relational Postgresql database. A few of the tables contain very fat blobs, they would be much better of as NoSQL Documents. This would significantly lighten our relational database.
So, we thought of moving those blob-table out into a NoSQL solution like CosmosDB or MongoDB. However there are foreign key dependencies with purely relational tables and this complicates moving those tables out into their own database.
I have found that PSQL natively supports storing Documents and can be distributed. The solutions I looked at so far are CitusData and Postgres XL. For those who used those how do they compare?
Has anyone encountered similar situations before? Did you separate out into a NoSQL database? Or has anyone partitioned their PSQL into relational and NoSQL parts? How did that go? What would you recommend to look out for in hindsight?
(Citus Engineer Here)
Postgres has JSONB column type which is powerful and flexible. What you can do is to keep your structural table as is and put a jsonb column for the blob data. Test this with single node Postgres and if that works for you, great!
If you have a problem with the scale of your data, i.e. memory or storage or CPU of a single machine is not enough for your workload and you cannot go bigger, then you can try scaling out with Citus or Postgres-XL.
I have no experience with Postgres-XL but Citus is pretty easy to try. There are docker images that you can use or you can create an account on Citus Cloud to try a 1-week free dev plan (it would not be suitable for benchmarking purposes).
Every RDBMS->NoSQL migration would require one of the two:
1. embedding some of these dependent documents into the ones that are actually queried by the user
2. referencing dependent documents by id and inferring these relationships on read.
Very typical, everyone does it every day, don't be afraid. BTW, you don't have to make a choice between Cosmos DB and MongoDB - just use Cosmos DB with MongoDB API.
Related
For a project I need two types of tables.
hypertable (which is a special type of table in PostgreSQL (in PostgreSQL TimescaleDB)) for some timeseries records
my ordinary tables which are not timeseries
Can I create a PostgreSQL TimescaleDB and store my ordinary tables on it? Are all the tables a hypertable (time series) on a PostgreSQL TimescaleDB? If no, does it have some overhead if I store my ordinary tables in PostgreSQL TimescaleDB?
If I can, does it have any benefit if I store my ordinary table on a separate ordinary PostgreSQL database?
Can I create a PostgreSQL TimescaleDB and store my ordinary tables on it?
Absolutely... TimescaleDB is delivered as an extension to PostgreSQL and one of the biggest benefits is that you can use regular PostgreSQL tables alongside the specialist time-series tables. That includes using regular tables in SQL queries with hypertables. Standard SQL works, plus there are some additional functions that Timescale created using PostgreSQL's extensibility features.
Are all the tables a hypertable (time series) on a PostgreSQL TimescaleDB?
No, you have to explicitly create a table as a hypertable for it to implement TimescaleDB features. It would be worth checking out the how-to guides in the Timescale docs for full (and up to date) details.
If no, does it have some overhead if I store my ordinary tables in PostgreSQL TimescaleDB?
I don't think there's a storage overhead. You might see some performance gains e.g. for data ingest and query performance. This article may help clarify that https://docs.timescale.com/timescaledb/latest/overview/how-does-it-compare/timescaledb-vs-postgres/
Overall you can think of TimescaleDB as providing additional functionality to 'vanilla' PostgreSQL and so unless there's a reason around application design to separate non-time-series data to a separate database then you aren't obliged to do that.
One other point, shared by a very experienced member of our Slack community [thank you Chris]:
To have time-series data and “normal” data (normalized) in one or separate databases for us came down to something like “can we asynchronously replicate the time-series information”?
In our case we use two different pg systems, one replicating asynchronously (for TimescaleDB) and one with synchronous replication (for all other data).
Transparency: I work for Timescale
I have a use case to distribute data across many databases on many servers, all in postgres tables.
From any given server/db, I may need to query another server/db.
The queries are quite basic, standard selects with where clauses on standard fields.
I have currently implemented postgres_FDW, (I'm, using postgres 9.5), but I think the queries are not using indexes on the remote db.
For this use case (a random node may query N other nodes), which is likely my best performance choice based on how each underlying engine actually executes?
The Postgres foreign data wrapper (postgres_FDW) is newer to
PostgreSQL so it tends to be the recommended method. While the
functionality in the dblink extension is similar to that in the
foreign data wrapper, the Postgres foreign data wrapper is more SQL
standard compliant and can provide improved performance over dblink
connections.
Read this article for more detailed info: Cross Database queryng
My solution was simple: I upgraded to Postgres 10, and it appears to push where clauses down to the remote server.
We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.
Hi Oracle DBA Gurus, Is there an easy way to do to refresh production database schemas to stage database with smaller amount produciton data (not all production data)? Both databases are 10g r2.
For the schemas? Yes, there exists numerous solutions for this. For the data? I don't believe there is a generic way to do this. Since your data is unique for you, it's impossible to build a generic solution copying a subset of data, taking into account relationships, triggers, stored procedures etc.
Depending on what data you have, it might not be hard to do manually anyway. Once you have working solution, it can be very handy for a long time with small efforts in updating.
Does anyone have experience of using PostgreSQL for an OLAP setup, using cubes against the database etc. Having come across a number of idiosyncracies when using MySQL for OLAP, are there reasons in favour of using PostgreSQL instead (assuming that I want to go the open source route)?
There are a number of data warehousing software vendors that are based on Postgresql (and contribute OLAP related changes back to core fairly regularly). Check out https://greenplum.org/. You'll find that PG works a lot better (for nearly any workload, OLAP especially) than MySQL. Greenplum and other similar solutions should work a bit better than PG depending on your data sets and use cases.
PGSQL is much better suited for Data Warehousing compared to MySQL. We had thought initially to go with MySQL, but it performs poorly in aggregations if data grows to a few million rows. PGSQL performs almost 20 times faster in caparison with MySQL for 20 million records for a single fact table on same hardware setup. If for some reason you choose to go with MySQL, then you should use MyISAM storage engine for fact tables rather then InnoDB; you will see slightly better performance.