Spark and Postgres - How to read big tables in a partitioned way? - postgresql

I'm looking for suggestions to approach this problem:
parallel queries using JDBC driver
big (in rows) Postgres table
there is no numeric column to be used as partitionColumn
I would like to read this big table using multiple parallel queries, but there is no evident numeric column to partition the table. I though about the physical location of the data using CTID, but I'm not sure if I should follow this path.

The spark-postgres library provides several functions to read/load postgres data. It uses the COPY statement under the hood. As a result it can handle large postgres tables.

Related

Have an ordinary table on a PostgreSQL TimescaleDB (timeseries) database

For a project I need two types of tables.
hypertable (which is a special type of table in PostgreSQL (in PostgreSQL TimescaleDB)) for some timeseries records
my ordinary tables which are not timeseries
Can I create a PostgreSQL TimescaleDB and store my ordinary tables on it? Are all the tables a hypertable (time series) on a PostgreSQL TimescaleDB? If no, does it have some overhead if I store my ordinary tables in PostgreSQL TimescaleDB?
If I can, does it have any benefit if I store my ordinary table on a separate ordinary PostgreSQL database?
Can I create a PostgreSQL TimescaleDB and store my ordinary tables on it?
Absolutely... TimescaleDB is delivered as an extension to PostgreSQL and one of the biggest benefits is that you can use regular PostgreSQL tables alongside the specialist time-series tables. That includes using regular tables in SQL queries with hypertables. Standard SQL works, plus there are some additional functions that Timescale created using PostgreSQL's extensibility features.
Are all the tables a hypertable (time series) on a PostgreSQL TimescaleDB?
No, you have to explicitly create a table as a hypertable for it to implement TimescaleDB features. It would be worth checking out the how-to guides in the Timescale docs for full (and up to date) details.
If no, does it have some overhead if I store my ordinary tables in PostgreSQL TimescaleDB?
I don't think there's a storage overhead. You might see some performance gains e.g. for data ingest and query performance. This article may help clarify that https://docs.timescale.com/timescaledb/latest/overview/how-does-it-compare/timescaledb-vs-postgres/
Overall you can think of TimescaleDB as providing additional functionality to 'vanilla' PostgreSQL and so unless there's a reason around application design to separate non-time-series data to a separate database then you aren't obliged to do that.
One other point, shared by a very experienced member of our Slack community [thank you Chris]:
To have time-series data and “normal” data (normalized) in one or separate databases for us came down to something like “can we asynchronously replicate the time-series information”?
In our case we use two different pg systems, one replicating asynchronously (for TimescaleDB) and one with synchronous replication (for all other data).
Transparency: I work for Timescale

Postgres Cluster exceeds temp_file_limit

Recently, we are trying to migrate our database from SQL Server to PostgreSQL. But, we didn't know that by default, tables in Potsgres are ot clustered. Now, when our data has increased so much, we want to CLUSTER our table like so
CLUSTER table USING idx_table;
But seems like my data is a lot (maybe), so that it produces
SQL Error [53400]: ERROR: temporary file size exceeds temp_file_limit
(8663254kB)
Since, its not resulted by a query, which I cannot tune it to perform better, Is there any solution for this?
If for example I am needed to increase my temp_file_limit, is it possible to increase it only for temporary? Since I'm only running this CLUSTER once.
There is some important differences between SQL Server and PostgreSQL.
Sybase SQL Server has been designed from INGRES in the beginning of the eighties when INGRES was using massively the concept of CLUSTERED indexes which means that table is organized as an index. The SQL Engine was designed especially to optimize the use of CLUSTERED index. That is the ways that SQL Server actually works...
When Postgres was designed, the use of CLUSTERED indexes disappeared.
When Postgres switched to the SQL language, an then be renamed to PostgreSQL nothing have changed to use CLUSTERED indexes.
So the use of CLUSTER tables in PostgreSQL is rarely optimal in execution plans. You have to prove individually for each table and for some queries involving those tables, if there is a benefit or not...
Another thing is that CLUSTERing a table in PostgreSQL is not the equivalent of MS SQL Server's CLUSTERED indexes...
More information about this will be find in my paper :
PostgreSQL vs. SQL Server (MSSQL) – part 3 – Very Extremely Detailed Comparison
An especially in § : "6 – The lack of Clustered Index (AKA IOT)"

Build table of tables from other databases in Postgres - (Multiple-Server Parallel Query Execution?)

I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.

Most efficient way to extract tables from Redshift?

I have large table (~1e9 rows, ~20 columns) in a AWS Redshift instance. I would like to extract this entire table through PostgreSQL in order to pipe the data into another columnar storage. Ideally, columns would be extracted one column at a time while maintaining an identical row ordering - as it would facilitate a lot of work on the receiving end (columnar).
How can I ensure that the series of SQL queries stay exactly aligned with each other? Thanks!
Ps: I am aware of the unload through S3 option, but I am seeking a PostgreSQL option.

Can Spring-JPA work with Postgres partitioning?

We have a Spring Boot project that uses Spring-JPA for data access. We have a couple of tables where we create/update rows once (or a few times, all within minutes). We don't update rows that are older than a day. These tables (like audit table) can get very large and we want to use Postgres' table partitioning features to help break up the data by month. So the main table always has this calendar month's data but if the query requires retrieval from previous months it would somehow read it from other partitions.
Two questions:
1) Is this a good idea for archiving older data but still leave it query-able?
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Thanks.
I am working with postgres partitioning with Hibernate & Spring JPA for a period of time. So I think, I can try to answer your questions.
1) Is this a good idea for archiving older data but still leave it query-able?
If you are applying indexes and not re-indexing table frequently, then partitioning of data may result faster query results.
Also you can use clustered index feature in postgres as well to fetch the data faster.
Because table with older data will not going to be updated, so clustered index will improve the performance efficiently.
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Spring JPA will work out of the box with partitioned table. It will retrieve the data from master as well as child tables and returns the concatenated result set.
Note : Issue with partitioned table
The only issue you will face with partitioned table is insertion in partitioned table.
Let me explain, when you partition a table, you will create a trigger over master table, and that trigger will return null. This is the key behind insertion issue in partitioned table using Spring JPA / Hibernate.
When you try to insert a row using Spring JPA or Hibernate you will face below issue
Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
To overcome this issue you need to override implementation of Batching batcher.
In hibernate you can provide the custom implementation of batcher factory using below configuration
hibernate.jdbc.factory_class=path.to.my.batcher.factory.implementation
In spring JPA you can achieve the same by custom implementation of batch builder using below configuration
hibernate.jdbc.batch.builder=path.to.my.batch.builder.implementation
References :
Custom Batch Builder/Batch in Spring-JPA
Demo Application
In addition to the #Anil Agrawal answer.
If you are using spring boot 2 then you need to define the customBatcher using the property.
spring.jpa.properties.hibernate.jdbc.batch.builder=net.xyz.jdbc.CustomBatchBuilder
You do not have to break down the JDBC query with postgres 11+.
If you execute select on the main table with plain jdbc, the DB would return the aggregated results from the partitioned tables.
In other words, the work is done by the Postgres DB, so Spring JPA will simply get the result and map it to objects as if there were no partitioning.
For having inserts work in a partitioned table you need to make sure that your partitions are already created, i think spring data will not create them for you.