We have a use case where we have to populate fresh data in our DB. There is already old data present from the successful DAG run in our DB . Now we need to delete the old data and re run the task.
Airflow already provides a command to clear selection.
airflow clear -dx occupancy_reports.* -t building -s 2022-04-01 -e 2022-04-30
Will running this also delete the data from the Database and then populate fresh data ?
I guess you meant : airflow **tasks** clear ...
It is only clear the set of task instance, as if they never ran (it is not rollback)
Related
I am running airflow using postgres.
There was a phenomenon that the web server was slow during operation.
It was a problem caused by data continuing to accumulate in dag_run and log of the db table (it became faster by accessing postgres and deleting data directly).
Are there any airflow options to clean the db periodically?
If there is no such option, we will try to delete the data directly using the dag script.
And I think it's strange that the web server slows down because there is a lot of data. Does the web server get all the data when opening another window?
You can purge old records by running:
airflow db clean [-h] --clean-before-timestamp CLEAN_BEFORE_TIMESTAMP [--dry-run] [--skip-archive] [-t TABLES] [-v] [-y]
(cli reference)
It is a quite common setup to include this command in a DAG that runs periodically.
I have a NodeJS Express App that depends on MongoDB change streams. For them to be available, MongoDB has to be configured to run as a replica set (even if there is only one node in that set).
I'm working on Windows 10 pro.
I'm trying to dockerize this App, basing the MongoDB container off the official mongo:5 image.
For this to work, I want an automated way of initializing the DB as a replica set. Tutorials I've found rely on either execing into the container and running rs.initiate() from mongosh (or similar approaches), which is manual work I want to avoid. Or they use hacks like wait-for-it.sh as here.
I feel there must be a better solution, based somehow on the paragraph "Initializing a fresh instance", from the docs.
It describes that
When a container is started for the first time it will execute files with extensions .sh and .js that are found in /docker-entrypoint-initdb.d.
When exactly in the container lifecycle does that happen? After the container is initialized? Or after the DB is ready? Because this seems to be the perfect place for this initialization logic, which runs flawlessly when executed manually, from within the container.
However, placing
// initReplSet.js
print('Script running');
config={"_id":"rs0", "members":[{"_id":0,"host":"app-db:27017"}]};
print(JSON.stringify(rs.initiate(config)));
print('Script end');
fails with the error {"ok":0,"errmsg":"No host described in new configuration with {version: 1, term: 0} for replica set rs0 maps to this node","code":93,"codeName":"InvalidReplicaSetConfig"}, yet the database is available under the hostname app-db from other containers. This makes me feel that this code runs too early, before all other initialization logic (networking) is done.
Another approach is to place a bash script that executes code via mongosh. Here's what I've tried:
#!/bin/bash
mongosh "mongodb://app-db:27017/app_db" "initiateReplSet"
where initiateReplSet is
config={"_id":"rs0", "members":[{"_id":0,"host":"app-db:27017"}]}
rs.initiate(config)
exit
but this crashes the container with the error
/usr/local/bin/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/initiateReplSetWrapper.sh
{"t":{"$date":"2022-02-15T11:31:23.353+00:00"},"s":"I", "c":"-", "id":4939300, "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"NotYetInitialized: Cannot use non-local read concern until replica set is finished initializing.","nextWakeupMillis":600}}
Warning: Could not access file: EACCES: permission denied, mkdir '/home/mongodb'
Current Mongosh Log ID: 620b8f0b04b7ad69b446768d
Connecting to: mongodb://app-db:27017/app_db?directConnection=true&appName=mongosh+1.1.9
Only the first and the last three lines seem to really belong to the bash script, the second line is repeated constantly.
I'm not sure whether the error originates at the permission denied issue, or whether the DB really can't be accessed. However, specifying
RUN mkdir -p /home/mongodb/.mongodb
RUN chown -R 777 /home/mongodb
in the Dockerfile did not improve the situation (same error nevertheless).
Could you please explain either why this approach can not work, or how to make it work? Is there another, better, automated way to initialize the replica set? Could the docker image be improved to allow such initialization logic?
I just made it work with a wild experiment. Means I simply left out the config in my call to rs.initiate(), from the JS script. For some reason, the script then runs successfully and change streams become available to my NodeJS backend.
I will post everything that's needed to run a MongoDB docker with change streams enabled:
# Dockerfile
From mongo
WORKDIR .
COPY initiateReplSet.js ./docker-entrypoint-initdb.d/
CMD ["-replSet", "rs0"]
// initiateReplSet.js
rs.initiate()
I accidentally deleted a volume of docker mongo-data:/data/db , i have a copy of that folder , now the problem is when i run docker-compose up mongodb container doesn't start and gives an error of mongo_1 exited with code 14 below more details of the error and the mongo-data folder , can you someone help me please
in docker-compose.yml
volumes:
- ./mongo-data:/data/db
Restore from backup files
A step-by-step process to repair the corrupted files from a failed mongodb in a docker container:
! Before you start, make copy of the files. !
Make sure you know which version of the image was running in the container
Spawn new container with to run the repair process as follows
docker run -it -v <data folder>:/data/db <image-name>:<image-version> mongod --repair
Once the files are repaired, you can start the containers from the docker-compose
If the repair fails, it usually means that the files are corrupted beyond repair. There is still a chance to repair it with exporting the data as described here.
How to secure proper backup files
The database is constantly working with the files, so the files are constantly changed on the disks. In addition, the database will keep some of the changes in the internal memory buffers before they are flushed to the filesystem. Although the database engines are doing very good job to assure the the database can recover from abrupt failure by using the 2-stage commit process (first update the transaction-log than the datafile), when the files are copied there could be a corruption that will prevent the database from recovery.
Reason for such corruption is that the copy process is not aware of the database written process progress, and this creates a racing condition. With very simple words, while the database is in middle of writing, the copy process will create a copy of the file(s) that is half-updated, hence it will be corrupted.
When the database writer is in middle of writing to the files, we call them hot files. hot files are term from the OS perspective, and MongoDB also uses a term hot backup which is a term from MongoDB perspective. Hot backup means that the backup was taken when the database was running.
To take a proper snapshot (assuring the files are cold) you need to follow the procedure explained here. In short, the command db.fsyncLock() that is issued during this process will inform the database engine to flush all buffers and stop writing to the files. This will make the files cold, however the database remains hot, hence the difference between the terms hot files and hot backup. Once the copy is done, the database is informed to start writing to the filesystem by issuing db.fsyncUnlock()
Note the process is more complex and can change with different version of the databse. Here I give a simplification of it, in order to illustrate the point about the problems with the file snapshot. To secure proper and consistent backup, always follow the documented procedure for the database version that you use.
Suggested backup method
Preferred backup should always be the data dump method, since this assures that you can restore even in case of upgraded/downgraded database engines. MongoDB provides very useful tool called mongodump that can be used to create database backups by dumping the data, instead by copy of the files.
For more details on how to use the backup tools, as well as for the other methods of backup read the MongoDB Backup Methods chapter of the MondoDB documentation.
I've created docker image with PostgreSQL running inside and exposing 5432 port.
This image doesn't contain any database inside. Container is an empty PostgreSQL database server.
I'd like in (or during) "docker run" command:
attach db file
create db via sql query execution
restore db from dump
I don't want to keep the data after container will be closed. It's just a temporary development server.
I suspect it's possible to keep my "docker run" command string quite short/simple.
Probably there it is possible to mount some external folder with db/sql/dump in run command and then create db during container initialization.
What are the best/recommended way and the best practices to accomplish this task? Probably somebody can point me to corresponding docker examples.
This is a good question and probably something other folks asked themselves more than once.
According to the docker guide you would not do this in a RUN command. Instead you would create yourself an ENTRYPOINT or CMD in your Dockerfile that calls a custom shell script instead of calling the postgres process direclty. In this scenario the DB would be created in a "real" filesystem, but then cleaned-up during shutdown of the container.
How would this work? The container would start, call the ENTRYPOINT or CMD as usual and consume the init script to get the DB filled. Then at the moment the container is stopped, the same script will be notified with a signal and manually drop the database content.
CMD ["cleanAndRun.sh"]
A sketched script "cleanAndRun.sh" taken from the Docker documentation and modified for your needs. Please remember it is a sketch only and needs modifications:
#!/bin/sh
# The script that is called in the trap must also stop the DB, so below call to
# dropdb is not enough, it just demonstrates how to call anything in the stop-container scenario!
trap "dropdb <params>" HUP INT QUIT TERM
# init your DB -every- time container starts
<init script to import to clean and import dump>
# start your postgres DB
postgres
echo "exited $0"
I have dozens of unlogged tables, and doc says that an unlogged table is automatically truncated after a crash or unclean shutdown.
Based on that, I need to check some tables after database starts to see if they are "empty" and do something about it.
So in short words, I need to execute a procedure, right after database is started.
How the best way to do it?
PS: I'm running Postgres 9.1 on Ubuntu 12.04 server.
There is no such feature available (at time of writing, latest version was PostgreSQL 9.2). Your only options are:
Start a script from the PostgreSQL init script that polls the database and when the DB is ready locks the tables and populates them;
Modify the startup script to use pg_ctl start -w and invoke your script as soon as pg_ctl returns; this has the same race condition but avoids the need to poll.
Teach your application to run a test whenever it opens a new pooled connection to detect this condition, lock the tables, and populate them; or
Don't use unlogged tables for this task if your application can't cope with them being empty when it opens a new connection
There's been discussion of connect-time hooks on pgsql-hackers but no viable implementation has been posted and merged.
It's possible you could do something like this with PostgreSQL bgworkers, but it'd be a LOT harder than simply polling the DB from a script.
Postgres now has pg_isready for determining if the database is ready.
https://www.postgresql.org/docs/11/app-pg-isready.html