FiveTran connects with PostgreSQL database restored every day - postgresql

I have set up a Fivetran connector to connect to a PostgreSQL database in an EC2 server and snowflake. The connection seems to work (no error), but the data is not really updated.
On the EC2 server, every day a script will pull down the latest dump of our app production database and restore it on the EC2 server, and then the Fivetran connector is expected to sync the database to snowflake. But the data after the first setup date is not synced with the snowflake. Could FiveTran be used in such a setup? If so, do you know what may be the issue of the sync failing?

Could FiveTran be used in such a setup?
Yes, but it's not ideal.
If so, do you know what may be the issue of the sync failing?
It's hard to answer this question without more context, however: Fivetran uses logging to replicate your DB (WAL in the case of PostgreSQL), so if you restore the DB every single day Fivetran will loose track of the changes and will need to re-sync the whole database.
The point made by NickW is completely valid, why not replicate from the DB? I assume the answer is along the lines of the data you need to modify. You can use column blocking and/or hashing to prevent sensible data from being transfered, or to obfuscate it before it's flushed to Snowflake.

Related

Rundeck MariaDB hot backup

On rundeck backup guide, noted that is mandatory to stop rundeck to take full backup when using data file. Now, that guide don't show any secure method to backup full rundeck instance (rundeck server + database) when using MariaDB, PostgreSQL, or any supported database as a backend.
In a real production scenario, not seems to be possible to stop rdeck on a daily basis.
Can anybody share best pratices to take a hot full backup on rdeck installation without stop rdeck?
Is there any secure and supported way to achive a full consistent rdeck projects and jobs definitions and database on a daily basis ?
In this post, answer is not clear, because question don't describe what kinbd of backend are used.
The documentation suggests the instance shutdown because some execution could be active, and that means a potentially active transaction in the middle of the "hot backup process" which means a potential data corruption in your backup. Is the safest way to backup your database.
If you want to do a "hot" backup you can export your projects (with all content, including jobs) and keys.

SQL Server Always on configuration without backup restore

The secondary server is very far from the primary server. The database size is too huge to copy over the internet. Physically copying the file to an external device and then taking it over to the secondary site, copying it back in a drive to the new server and then restore is also time consuming.
Is there a way to add the secondary server to the Always on configuration without having the need to restore the database first on the secondary server creating a blank database on secondary server to start sync?
PS Note: Secondary server configuration we need it to be read only.
Is there a way to add the secondary server to the Always on configuration without having the need to restore the database first on the secondary server creating a blank database on secondary server to start sync? PS Note: Secondary server configuration we need it to be read only.
It's not clear what you're expecting as an answer.
Firstly, a secondary AG replica is always read only.
You can choose to add a database to an AG using Automatic Seeding, or you can add an existing database by backing up the database and its transaction log from the primary and manually restoring on the secondary.
You can only only join a database to an availability group where its last committed LSN is within the range of the current active log.
Either way, the database(s) you want to add to the AG will have the data copied to the secondary somehow, whether that's over the internet by using automatic seeding, manually copying backup files (the most reliable option in my experience) or by physical media.
Last time I checked, by magic was not an option! :-)

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Will "PostgreSQL Streaming Replication" fit this use case?

I am designing an application for public organizations.
The purpose is to record data (text and video streams) which will be produced in "local" offices, where connectivity is not guaranteed, and where the power will be available only during the occurrences of meetings.
One of the requisites of the project is the "locality" of the data storage, since data is considered "sensitive" and "important".
One second requisite of the project is to publish to a web server a portion of the data produced during the meetings.
The database server shall be PostgreSQL.
I plan to set up a second PostgreSQL database server on the web infrastructure hosting the web server, and synchronize it with the "local" database.
The "public" database will be accessed only by *selection queryes" (no writes).
I see PostgreSQL does implement "Streaming Replication" PostgreSQL Streaming Replication since version 9.0.
The question(s):
Is PostgreSQL Streaming Replication ready for primetime?
Does it fit the use case I describe above?
Should I expect any major problem?
Could you suggest alternative, better solutions?
Yes it is the best solution for your case you should know that
the master database and standby database will be 100% identiques
standby database will not allow to write (read only)
If you have the configuration of master - standby you will not have problems , but if you use master - master configuration , it may cause some problems .

Local MongoDB instance with index in remote server

One of our clients have a server running a MongoDB instance and we have to build an analytical application using the data stored in their MongoDB database which changes frequently.
Clients requirements are:
That we do not connect to their MongoDB instance directly or run another instance of MongoDB on their server but just somehow run our own MongoDB instance on our machine in our office using their MongoDB database directory with read only access remotely.
We've suggested deploying a REST application, getting a copy of their database dump but they did not want that. They just want us to run our own MongoDB intance which is hooked up with the MongoDB instance directory. Is this even possible ?
I've been searching for a solution for the past two days and we have to submit a solution by Monday. I really need some help.
I think this is normal request because analytical queries could cause too much load on the production server. It is pretty normal to separate production and analytical databases.
The easiest option is to use MongoDB replication. Set up MongoDB replica set with production database instance as primary and analytical database instance as secondary, also configure the analytical instance to never become primary.
If it is not possible to use replication - for example client doesn't want this, the servers could not connect directly to each other... - there is another option. You can read oplog from remote database and apply operations to your database instance. This is exactly the low level mechanism how replica set works, but you can do it manually too. For example MMS (Mongo Monitoring Sevice) Backup uses reading oplog for online backups of MongoDB.
Update: mongooplog could be the right tool for real-time application of replication oplog pulled from remote server on local server.
I don't think that running two databases that points to the same database files is possible or even recommended.
You could use mongorestore to restore from their data files directly, but this will only work if their mongod instance is not running (because mongorestore will need to lock the directory).
Another solution will be to do file system snapshots and then restore to your local database.
The downside to this backup/restore solutions is that your data will not be synced all the time.
Probably the best solution will be to use replica sets with hidden members.
You can create a replica set with just two members:
Primary - this will be the client server.
Secondary - hidden, with votes and priority set to 0. This will be your local instance.
Their server will always be primary (because hidden members cannot become primaries). Clients cannot see hidden members so for all intents and purposes your server will be read only.
Another upside to this is that the MongoDB replication will do all the "heavy" work of syncing the data between servers and your instance will always have the latest data.