I have to upsert large number of rows in multiple tables in Postgres 9.6 daily. 6 million rows of 1kb each can be loaded on some of these tables, volume is extremely large and this needs to be done rather quickly.
I have some transformation needs so I cannot use copy directly from source; also, copy doesn't update existing rows. I could use fdw wrapper after transformation, but that will also require SQL. So, I decided to write a java program using producer consumer concurrency pattern with producers reading from files, and consumers writing to postgres with multi row upserts.
I can recreate data from ultimate source, I figure I can skip WAL and use unlogged tables - make them unlogged before copy and make them logged after copy, I am on v9.6.
I did all this and now want to perform performance tests, before I do that I want to know what a commit and checkpoint of an unlogged table means. I suspect data and index files (I have dropped the index) are going to be written at checkpoint. Here are my questions
1) What happens at commit since commit applies to WAL and there is no data for unlogged tables in WAL? Is commit writing my data and index files to disk; or, its unnecessary operation and data will only be written at checkpoint?
2) Ultimately, what should I be tuning for unlogged tables?
Thanks.
Related
I need to migrate from an old postgreSql database with an old schema (58 tables) to a new database with a new schema (40 tables). The patterns are completely different.
It is not a simple migration (copy and paste). But rather a copy-transform-paste.
I decided to write a batch and use spring batch, spring data and jpa. So I have two dataSources and a chainedTransaction. My config spring is mainly made up of chunck Task with a JpaPagingItemReader and an ItemWriterAdapter.
For performance needs, I also configured Partitioner which allows me to partition my source tables into several sub-tables and a chunckSize = 500000
Everything works smoothly. But considering the size of my old table it takes me a week to migrate all the data.
I will want to do a test which will consist of running my Batch without committing. Just that hibernate generates all sql requests in a ".sql" file, but does not commit the data to the database.
This will allow me to see if the commit is costly in execution time.
Is it possible to configure hibernate to flush only but never commit? A kind of commit simulation ?
Thank's
Usually, the costly part is foreign key and unique key checks as well as index maintenance, but since you don't write how you fetch data, it could very well be the case that you are accessing your data in an inefficient manner.
In general, I would recommend you to create a dump with pg_dump, restore that and then try to do the migration in an SQL only way. This way, no data has to flow around but can stay on the machine which is generally much more efficient.
I'm migrating from a proprietary dbms to PG. In the proprietary dbms, "offlining" and "onlining" data partitions is a very lightweight operation. I'm looking to implement similar functionality with PG by backup and restore of individual table (partitions). Obviously I need to avoid a performance regression. So my question is what the fastest way is of:
Backing up a table (partition), both data and indexes
Taking the table offline (meaning that the data is now gone from the database)
Restoring the table (partition), both data and indexes
Once I have some advice I can design more targeted performance comparisons. Thanks in advance for any pointers.
What is fast and needs to be fast is adding or removing a partition (ALTER TABLE ... ATTACH/DETACH PARTITION).
After you have detached the partition you are in no great hurry to backup/export the data. This can be done comfortably with pg_dump.
Similarly, importing the data for a table that is to become a new partition is normally not time critical.
If you need this to happen faster (for example, you want the old partition to be visible in another database as soon as it is detached in the old one), you could use logical replication to replicate the partition to another PostgreSQL database before you detach it. As soon as replication has caught up, you detach or drop the original partition and attach the copy in the other database.
As I understand, pg_repack creates a temporary 'mirror' table (table B) and copies the rows from the original table (table A) and re-indexes them and then replaces the original with the mirror. The mirroring step creates a lot of noise with logical replication (a lot of inserts at once), so I'd like to ignore the mirror table from being replicated.
I'm a bit confused with what happens during the switch over though. Is there a risk with losing some changes? I don't think there is since all actual writes are still going to the original table before and after the switch, so it should be safe right?
We're running Postgres 10.7 on AWS Aurora, using wal2json as the output plugin for replication.
I have neither used pg_repack nor logical replication but according to pg_repack Github repository there is a possible issue using pg_repack with logical replication: see
https://github.com/reorg/pg_repack/issues/135
To perform a repack, pg_repack will:
create a log table to record changes made to the original table.
add a trigger onto the original table, logging INSERTs, UPDATEs, and DELETEs into our log table.
create a new table containing all the rows in the old table.
build indexes on this new table.
apply all changes which have occurred in the log table to the new table.
swap the tables, including indexes and toast tables, using the system catalogs.
drop the original table.
In my experience, the log table keeps all changes and applies them after build indexes, besides if repack needs to rollback changes applied on the original table too.
I need some very simple storage to keep data, before it will be added to PostgreSQL. Currently, I have web application that collects data from clients. This data is not very important. Application collects about 50 kb of data (simple string) from one client. It will be collect about 2GB data per hour.
This data is not needed ASAP and it's nothing if it will be lost.
Is there a existing solution to store it in memory for a while (~ 1 hour), and then write it all in PostgreSQL. I don't need to query it in any way.
I can use Redis, probably, but Redis is too complex for this task.
I can write something by myself, but this tool will be must to handle many requests to store data (maybe about 100 per second) and existing solution may be better.
Thanks,
Dmitry
If you do not plan to work this data operatively so why do you want to store it in memory? You may create UNLOGGED table and store data in this table.
Look at the documentation for details:
UNLOGGED
If specified, the table is created as an unlogged table. Data written
to unlogged tables is not written to the write-ahead log, which makes
them considerably faster than ordinary tables.
However, they are not crash-safe: an unlogged table is automatically
truncated after a crash or unclean shutdown. The contents of an
unlogged table are also not replicated to standby servers. Any indexes
created on an unlogged table are automatically unlogged as well;
however, unlogged GiST indexes are currently not supported and cannot
be created on an unlogged table.
Storing data in memory sounds like caching to me. So, if you are using Java, I would recommend Guava Cache to you!
It seems to fit all your requirements, e.g. setting an expiry delay, handling the data once it is evicted from the cache:
private Cache<String, Object> yourCache = CacheBuilder.newBuilder()
.expireAfterWrite(2, TimeUnit.HOURS)
.removalListener(notification -> {
// store object in DB
})
.build();
In postgres 9.3, the COPY FROM ... STDIN; command is by far the quickest way to insert bulk data. Does this come at the cost of not writing these inserted rows to the transaction log? We're using Write-Ahead Logging to update secondary servers so it is important that it does.
COPY most certainly does write to WAL (unless you're COPYing to an UNLOGGED or TEMPORARY table, of course).
Data loaded with COPY gets replicated normally.
About the only thing you can do that isn't properly replicated is write to a hash index, and the documentation for those is covered in warnings. Personally I'd like to just remove that feature.