I want to consolidate a couple of historically grown scripts (Python, Bash and Powershell) which purpose is to sync data between a lot of different database backends (mostly postgres, but also oracle and sqlserver) and on different sites. There isn't really a master, its more like a loose couple of partner companies working on the same domain specific use cases, everyone with its own data silo and its my job to hold all this together as good as I can.
Currently those scripts I mentioned are cron scheduled and need to run on the origin server where a dataset gets initially written, to sync it to every partner over night.
I am also familiar with and use Apache Airflow in another project. So my idea was to use an workflow management tool like Airflow to streamline the sync process and get it more centralized. But also with Airflow there is only a time interval scheduler available to trigger a DAG.
As most writes come in over postgres databases, I'd like to make use of the NOTIFY/LISTEN feature and already have a python daemon based on this listening to any database change (via triggers) and calling an event handler then.
The last missing piece is how its probably best done to trigger an airflow DAG with this handler and how to keep all this running reliably?
Perhaps there is a better solution?
Related
I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.
Build servers are generally detached from the VPC running the instance. Be it Cloud Build on GCP, or utilising one of the many CI tools out there (CircleCI, Codeship etc), thus running DB schema updates is particularly challenging.
So, it makes me wonder.... When's the best place to run database schema migrations?
From my perspective, there are four opportunities to automatically run schema migrations or seeds within a CD pipeline:
Within the build phase
On instance startup
Via a warm-up script (synchronously or asynchronously)
Via an endpoint, either automatically or manually called post deployment
The primary issue with option 1 is security. With Google Cloud Sql/Google Cloud Build, it's been possible for me to run (with much struggle), schema migrations/seeds via a build step and a SQL proxy. To be honest, it was a total ball-ache to set up...but it works.
My latest project is utilising MongoDb, for which I've connected in migrate-mongo if I ever need to move some data around/seed some data. Unfortunately there is no such SQL proxy to securely connect MongoDb (atlas) to Cloud Build (or any other CI tools) as it doesn't run in the instance's VPC. Thus, it's a dead-end in my eyes.
I'm therefore warming (no pun intended) to the warm-up script concept.
With App Engine, the warm-up script is called prior to traffic being served, and on the host which would already have access via the VPC. The warmup script is meant to be used for opening up database connections to speed up connectivity, but assuming there are no outstanding migrations, it'd be doing exactly that - a very light-weight select statement.
Can anyone think of any issues with this approach?
Option 4 is also suitable (it's essentially the same thing). There may be a bit more protection required on these endpoints though - especially if a "down" migration script exists(!)
It's hard to answer you because it's an opinion based question!
Here my thoughts about your propositions
It's the best solution for me. Of course you have to take care to only add field and not to delete or remove existing schema field. Like this, you can update your schema during the Build phase, then deploy. The new deployment will take the new schema and the obsolete field will no longer be used. On the next schema update, you will be able to delete these obsolete field and clean your schema.
This solution will decrease your cold start performance. It's not a suitable solution
Same remark as before, in addition to be sticky to App Engine infrastructure and way of working.
No real advantage compare to the solution 1.
About security, Cloud Build will be able to work with worker pool soon. Still in alpha but I expect in the next month an alpha release of it.
I have experience in writing custom PowerShell prompts and autocompletes. I've mainly used it to build a custom git PowerShell prompt/autocomplete, and it works fine.
I'm now trying to build one for kubernetes. The big problem is the latency of querying the k8s server. For my git prompt, almost all the operations are local, so the latency hit of querying git repeatedly and frequently isn't an issue. Hitting a remote k8s cluster, not so fast.
What is a good approach for dealing with this? I was thinking that I could write a background process that queries k8s and stores the data in a local file that would be used for the prompt/autocomplete, but it seems like there should be a better way.
I'm about to prepare a deployment specification for the Websphere MQ production environment. As always I hate reinventing the wheel hence the question:
Is there an article, specififaction of best practices when it comes to deploying and maintaining the Webshpere MQ production environment?
Here are more specific doubts of mine:
Configuration versioning (MQSC, dmpmqcfg, etc).
Deploying new objects (MQSC or manual instructions?)
Deployment automation (maybe basing on the diff of dmpmqcfg?).
Deploying and versioning configuration alterations.
Currently I am simply creating MQ objects manually and versioning the output of dmpmqcfg. However, in a while there are going to be too many deployments to handle it like this.
That's an extremely broad question so I'll try to respond before a moderator deletes it. :-)
The answer depends on many things such as whether MQ clusters are in use, the approaches to high availability and disaster recovery, the security requirements, whether the QMgrs are configured as dedicated or shared infrastructure, etc. However, there are a few patterns that I follow in almost all cases, including non-Production. This is because things like monitoring and security tend to get dropped at deployment time if not tested in Dev and don't work as expected in Prod.
I use a script to create my QMgrs in Production to insure that basics like generating the X.509 certificate (or CSR) is always done according to standards, that any exits or exit parm files are present, that certain SupportPac executables (like q) are present in /opt/mqm/bin, circular queues, etc. It also checks for negative factors such as GSKit not being installed.
I have a baseline script that is run against all QMgrs. This script sets up the DLQ, any queues for monitoring agents, enables events as required, sets up system services, trigger monitors, listeners, etc. The exception is B2B gateway QMgrs which are handled in a class all their own and have very specific configurations not used on the internal network.
cluster.
I have several classes of QMgr with specific configuration requirements. These include cluster repositories (where primary and secondary are distinct sub-types), service-provider QMgrs, and service consumer QMgrs. These all have secondary scripts run against them.
I have scripts per-cluster to join or suspend a QMgr in cases where clustering is used (which for me is almost 100% since v7.1).
These set up a QMgr's infrastructure. Then I maintain scripts for each application. So for example, if there's a Payroll app, I'll have queues and possibly topics with names containing a PAY node such as PAY.EMPLOYEE.UPDT.REQ.V032.PRD. Corresponding to that will be a single script for all PAY.** queues. Used to be one for setmqaut commands too, but these are now in the same script as the objects. I only ever have one version of the script and keep a history of changes in the script. This way when I need to recreate a QMgr, I just run all the scripts for it. Similarly, if I need to deploy the PAY objects on another QMgr, I just copy the script to that server.
When defining objects for clusters, I always do a DEFINE NOREPLACE that contains all the run-time attributes such as whether the queue is enabled in the cluster. The queue is always defined as disabled in the cluster and for triggering but because I use NOREPLACE re-running the script doesn't change whatever state it has in, say, a month. Those things that are configuration and not run-time, such as the description, are handled in an ALTER immediately after the DEFINE and these are updated each time the script is run. There's an article on this here.
Finally, the scripts I use are of the self-executing, self-documenting variety. For example, many people put all the MQSC commands into a script then do something like:
runmqsc < payroll.mqsc > payroll.out
TONS of problems here. The main one is that it relies on the operator to know a lot and execute the script right all the time. For example, suppose (s)he forgets to capture the output? Or overwrites a previous output? Or doesn't get STDERR because (s)he needs to do the 2>&1 at the end and doesn't know redirection that well?
So my scripts are all written in ksh handle all the capturing of output, complete with time and date stamping and STDERR, can freely mix MQSC with OS commands, etc. All you do is go to the scripts directory for that QMgr and . ./*ksh to build/rebuild a QMgr.
I do of course also take regular configuration dumps, but these are more for running queries and reports like "how many QMgrs have this channel defined and where are they?" kind of thing.
Also, when taking backups there is almost NEVER a good reason to back up a QMgr at a point in time. However, if it is required be sure to stop the QMgr first. Also, think long and hard about capturing certificates in a backup. Many people are good about locking the certificate directory so only mqm can read it but often the backups are unprotected. As long as you aren't trying to restore on top of Production, many shops let you restore the Production /var/mqm/* files to your own sandbox. If the QMgr's KDB files are included, you just lost them. An alternative is to put the certificates in /etc or some other directory that is protected but not backed up with the QMgr's directories.
I have a few jobs in Jenkins that use Selenium to modify a database through a website's front end. If some of these jobs run at the same time, errors due to dirty reads can result. Is there a way to force certain jobs in Jenkins to be unable to run at the same time? I would prefer not to have to place or pick up a lock on the database, which could be read or modified by any number of users who are also testing.
You want the Throttle Concurrent Builds plugin which lets you define global and per-node semaphores.
Locks and latches is being deprecated in favor of Throttle Concurrent builds.
I've tried both the locks & latches plugin and the port allocator plugin as ways to achieve what you're trying to do. Neither worked reliably for me. Locks & latches worked some of the time, but I'd occasionally get hung jobs. Using port allocator as a hack will work unless you have multiple jenkins nodes, but the config overhead is kind of high. What I've ultimately settled upon is another hack, but it works reliably and uses core Jenkins stuff (no plugins):
create a slave node running on the same box as the master (or not, if you have lots of boxes)
give this slave a single executor (key)
tie the 2 (or n) jobs that must not run together to this new slave node
optionally set the slave's usage to 'tied jobs only' if it'll screw up your other jobs if they happen to run on the new slave
Since the slave has only one executor, the jobs tied to it can never run together.
If you regard the database as a shared resource that can only be used exclusively then this fits the usecase of the Lockable resources plugin.
It is being actively developed and improved and is very versatile.