How to replicate a Postgres DB with only a sample of the data - postgresql

I'm attempting to mock a database for testing purposes. What I'd like to do is given a connection to an existing Postgres DB, retrieve the schema, limit the data pulled to 1000 rows from each table, and persist both of these components as a file which can later be imported into a local database.
pg_dump doesn't seem to fullfill my requirements as theres no way to tell it to only retrieve a limited amount of rows from tables, its all or nothing.
COPY/\copy commands can help fill this gap, however, it doesn't seem like theres a way to copy data from multiple tables into a single file. I'd rather avoid having to create a single file per table, is there a way to work around this?

Related

How to save database with all tables limited with 1000 rows only

I have a database with 20 tables. Currently, we are using the pg_dump command daily to archive our database.
a few tables in this database are very big. We are working to make a light version of this database for testing purposes and small tickets.
So, I need a way to use pg_dump command and save all tables with only 1000 rows in each table. I tried to find anything like that in Google, but without success.

Amazon RDS Postgresql snapshot preserves schema but loses all data

Using AWS RDS console I created a snapshot backup of a Postgresql v11 database containing multiple schemas. I then created a new instance from the backup. The process seemed to work fine without error. However, upon inspection of the data in the new instance, I noticed that in only one of my schemas the data was not preserved. The schema structure, tables, indexes, constraints, etc looked fine, but every table was empty (select count(*) from schema.table was 0 for every table in the schema). All other schemas looked fine and contained the expected data. I looked everywhere (could not find help for this online) and tried many tests myself (changing roles, rebuilding the schema, privileges, much more) while attempting to solve this issue. What would cause my snapshots to preserve the entire schema structure, but lose all of the data itself?
I finally realized that the only difference between the problem schema and the other was that all tables in the problem schema had been created with the 'UNLOGGED' keyword. This was done to increase write speed for millions of rows inserted when the schema was first built. However, when a snapshot is created/restored as described above, the process depends on the WAL files that are written with normal (logged) tables to restore the data. To fix my problem I simply altered all of the tables and set them to be logged (alter table schema.table set logged). After this, snapshots worked fine. For anyone else in the future that is doing something similar, should unlogged tables be needed for initial mass population of data to get better write speed, it would be a good to changed them to be logged after initial data population (if you plan on using snapshots or replications or similar). Side note, pg_dump/pg_restore does still work for unlogged tables.

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

How can I combine multiple Postgres database dumps into a single database?

I have ~100 Postgres .dump from different sources. They all have the same schema, just a single table, and a few hundred to a few hundred thousand rows. However, the data was collected at different locations and now needs to all be combined.
So I'd like to merge all the rows from all the databases into one single database, ignoring the ID key. What would be a decent way to do this? I may collect more data in the future from more sources, so it's likely to be a process I need to repeat.
if needed use pg_restore to convert the dumps into SQL.
run the SQL dump trhough
sed '/^COPY .* FROM stdin;$/,/^\\.$/ p;d'
as there is only one table in your data that will give you the copy command needed to load the data send that to your database to load the data.

Replicating data between Postgres DBs

I have a Postgres DB that is used by a chat application. The chat system often truncates these tables when they grow to big but I need this data copied to another Postgres database. I will not be truncating the tables in this DB.
How I can configure a few tables on the chat-system's database to replicate data to another Postgres database. Is there a quick way to accomplish this?
Slony can replicate only select tables, but I'm not sure how it handles truncates, and it can be a pain to configure.
You might also use something like pgpool to send copies of the insert statements to a second database.
You might modify the source of your chat application to do two writes (one to each db) when a new record is created.
You could just write a script in Perl/PHP/Python to read from one and write to another, then fire it by cron so that you're sure it gets run before truncation.
If you only copy a batch of rows every other day, you may be better off with a plain INSERT to a different schema in the same database or a different database in the same database cluster (you need something like dblink for that).
The safest / fastest solution in the same database would be a data-modifying CTE. Something along these lines:
WITH del AS (
DELETE FROM tbl
WHERE <some condition>
RETURNING *
)
INSERT INTO backup.tbl
SELECT * FROM del;
For true replication consider these official sources:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://www.postgresql.org/docs/current/runtime-config-replication.html