How to bulk-refresh postgres database - postgresql

I've got a Postgres 9.1 database that contains weather information. The dataset consists of approximately 3.1 million rows.
It takes about 2 minutes to load the data from a CSV file, and a little less to create a multicolumn index.
Every 6 hours I need to completely refresh the dataset. My current thinking is I would import the new dataset into a different database name, such as "weather_imported" and once the import and index creation are finished, I would drop the original database and rename the imported database.
In theory, clients would continue to query the database during this operation, though if that has ill effects, I could probably arrange to have the clients silently ignore a few errors.
Questions:
Will that strategy work?
If a client happened to be in the
process of running a query at the time of DB drop, my assumption is
the database would not complete the drop until the query were
finished - true?
What if a query happened between the time the
DB were dropped and the rename? I assume a "database not found"
error.
Is there a better strategy?

Consider the following strategy as an alternative:
Include a "dataset version" field in the primary table.
Store the "current dataset version" in some central location, and write your selects to only search for rows which have the current dataset version.
To update the dataset:
Insert all the data with a new dataset version. (You could just use the start time of the update job as a version.)
Update the "current dataset version" atomically to the value you just inserted.
Delete all data with an older version than the version number you just inserted.
Presto -- no need to shuffle databases around.

Related

How can I automatically maintain a dump of modified rows in PostGreSql

So, I have a PostGreSQL DB. For some chosen tables in that DB I want to maintain a plain dump of the rows when modified. Note this dump is not a recovery or backup dump. It is just a file which will have the incremental rows. That is, whenever a row is inserted or updated, I want that appended to this file or to a file in a folder. Idea is to load that folder into say something like hive periodically so that I can run queries to check previous states of certain rows, columns. Now, these are very high transactional tables and the dump does not need to be real time. It can be in batches, every hour. I want to avoid a trigger firing hundreds of times every minute. I am looking for something which is off the shelf - already available in PostGreSQL. I did some research but everything is related to PostGreSQL backup - which is not the exact use case.
I have read some links like https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/ Implementing history of PostgreSQL table etc - but these are based on insert update trigger and create the history table on PostGreSQL itself. I want to avoid both. I cannot have the history on PostGreSQL as it will be huge soon. And I do not want to keep writing to files through a trigger firing constantly.

Bulk data insertion and updating from one db server to another db server

I have some set of tables which has 20 million records in a postgres server. As of now i m migrating some table data from one server to another server using insert and update queries with dependent tables in functions. It takes around 2 hours even after optimizing the query. I need a solution to migrate the data faster by using mongodb or cassandra. How?
Try putting your updates and inserts into a file and then load the file. I understand Postgresql will optimise loading the file contents. It's always worked for me although I haven't used that quantity.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

SQlite synchronization scheme

I know it is xmas eve, so it is a perfect time to find hardcore programers online :).
I have a sqlite db fiel that contains over 10 K record, I generate the db from a mysql database, I have built the sqlite db within my iphone application the usual way.
The records contains information about products and their prices, shops and the like, this info of course is not static, I use an automatic scheme to populate and keep updating my mysql db.
Now, how can I update the iphoen app sqlite database with the new information available in the mysql db, the db structure is still the same, but the records contains new information.
Thanks.
Ahed
info:
libsqlite3.0,
iphone OS 3.1,
mysql 2005,
Mac OS X 10.6.2
There is a question you need to answer first; How do you determine the set of changed records in your MySQL database?
Or, more specifically, given that the MySQL database is in state A, some transactions occur and now it is in state B, how do you know what changed between A and B?
Bottom line; you need a schema in MySQL that enables this. Once you have answered that question, then you can answer the "how do I sync problem?".
I have a similar application.
I am using Push Notification to let my users know there is new or updated data available.
Each time a record on the server is updated, I store a sequential record-number alongside the record.
Each UDID that's registered has a "last updated" number associated with it that contains the highest record-number it has ever downloaded.
When any given device comes to get it's updates, all database records greater than the UDID's last updated record-number as stored on the server are sent to the device. If everything goes OK, the last updated record-number for the UDID is set to the last record number sent.
The user has the option to fetch all records and refresh his database if he feels any need to sync his device to the entire database.
Seems to be working well.
-t
You can find many other similar questions by searching for "iphone synchronization":
https://stackoverflow.com/search?q=iphone+synchronization
I'm going to assume that the data is going only from mysql to sqlite, and not the reverse direction.
Three are a few ways that I could imagine doing this. The first is to just redownload
the entire database during every update. Another way, which I'm describing below, would be to create a "log" table to record the modifications to your master table, and then download just the new logs when doing the update.
I would creat a new "log" table in your SQL database to log changes to the table needing synchronization. The log could contain a "revision" column to track in what order changes were made, a "type" column to specify if it was a insert, update, or or delete, a the row-id if your affected row, and finally have the entire set of columns from your master table.
You could automate the creation of log entries by using stored procedures as wrappers to modify your master table.
With only 10k records, I wouldn't expect this log table to grow to be that huge.
You would then in sqlite keep track of the latest revision downloaded from mysql. To update the table, you would download all log entries after the latest update, and then apply them to your sqlite table.