Airflow Scheduler not picking up DAG Runs - scheduler

I'm setting up airflow such that webserver runs on one machine and scheduler runs on another. Both share the same MySQL metastore database. Both instances come up without any errors in the logs but the scheduler is not picking up any DAG Runs that are created by manually triggering the DAGs via the Web UI.
The dag_run table in MysQL shows few entries, all in running state:
mysql> select * from dag_run;
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
| id | dag_id | execution_date | state | run_id | external_trigger | conf | end_date | start_date |
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
| 1 | example_bash_operator | 2017-12-14 11:33:08.479040 | running | manual__2017-12-14T11:33:08.479040 | 1 | �� }�. | NULL | 2017-12-14 11:33:09.000000 |
| 2 | example_bash_operator | 2017-12-14 11:38:27.888317 | running | manual__2017-12-14T11:38:27.888317 | 1 | �� }�. | NULL | 2017-12-14 11:38:27.000000 |
| 3 | example_branch_dop_operator_v3 | 2017-12-14 13:47:05.170752 | running | manual__2017-12-14T13:47:05.170752 | 1 | �� }�. | NULL | 2017-12-14 13:47:05.000000 |
| 4 | example_branch_dop_operator_v3 | 2017-12-15 04:26:07.208501 | running | manual__2017-12-15T04:26:07.208501 | 1 | �� }�. | NULL | 2017-12-15 04:26:07.000000 |
| 5 | example_branch_dop_operator_v3 | 2017-12-15 06:12:10.965543 | running | manual__2017-12-15T06:12:10.965543 | 1 | �� }�. | NULL | 2017-12-15 06:12:11.000000 |
| 6 | example_branch_dop_operator_v3 | 2017-12-15 06:28:43.282447 | running | manual__2017-12-15T06:28:43.282447 | 1 | �� }�. | NULL | 2017-12-15 06:28:43.000000 |
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
6 rows in set (0.21 sec)
But the Scheduler that's started up on another machine and connected to the same MySQL DB is just not interested in talking to this DB and actually running these DAG runs and converting them to Task Instances.
Not sure what I'm missing in the setup here. So few questions:
When and how is the DAGS folder located at $AIRFLOW_HOME/dags populated? I think its when the webserver is started. But then if I just start the scheduler on another machine, how will the DAGS folder on that machine be filled up?
Currently, I'm doing airflow initdb only on the machine hosting the webserver and not on scheduler. Hope that is correct.
Can I enable debug logs for Scheduler to get more logs that could indicate what's missing? From the current logs it looks like it just looks in the DAGS folder on local system and finds no DAGS there ( not even example ones ) inspite of the config to load examples set as True.
Don't think it matters but I'm currently using a LocalExecutor
Any help is appreciated.
Edit: I know that I need to sync up DAGS folder across machines as the airflow docs suggest but not sure if this is the reason why Scheduler is not picking up the tasks in the above case.

Ok, I got the answer - It looks like the Scheduler does not query the DB until there are any DAGS in the local DAG Folder. The code in job.py looks like
ti_query = (
session
.query(TI)
.filter(TI.dag_id.in_(simple_dag_bag.dag_ids))
.outerjoin(DR,
and_(DR.dag_id == TI.dag_id,
DR.execution_date == TI.execution_date))
.filter(or_(DR.run_id == None,
not_(DR.run_id.like(BackfillJob.ID_PREFIX + '%'))))
.outerjoin(DM, DM.dag_id==TI.dag_id)
.filter(or_(DM.dag_id == None,
not_(DM.is_paused)))
)
I added a simple DAG in my local DAG folder on the machine hosting Scheduler and it started picking up other DAG instances as well.
We raised an issue for this - https://issues.apache.org/jira/browse/AIRFLOW-1934

Related

flyway sessions locking itself

I was trying to implement code through flyway:
create index concurrently if not exists api_client_system_role_idx2 on profile.api_client_system_role (api_client_id);
create index concurrently if not exists api_client_system_role_idx3 on profile.api_client_system_role (role_type_id);
create index concurrently if not exists api_key_idx2 on profile.api_key (api_client_id);
However flyway sessions were blocking each other and script is in "pending" state.
| Versioned | 20.1 | add email verification table | SQL | 2021-11-01 21:55:52 | Success |
| Versioned | 21.1 | create role for doc api | SQL | 2021-11-01 21:55:52 | Success |
| Versioned | 22 | create indexes for profile | SQL | 2022-10-21 10:23:41 | Success |
| Versioned | 23 | test flyway | SQL | | Pending |
+-----------+---------+----------------------------------------------+--------+---------------------+---------+
Flyway: Flyway Community Edition 9.3.1 by Redgate
Database: Postgresql 14.4
Can you please advice how to properly implement creating indexes concurrently in postgresql?
I've tried simply to kill blocking session and let the script to continue, however then implementation failed and scripts stayed in "Pending" status.

(Solved) How to read from mongodb collection by pymongo when densely writing into it?

Update:
It is solved. Please check myself's answer if you are interested in it. Thanks to everyone all the same!
My original post:
MongoDB server version: 3.6.8 (WSL Ubuntu 20.04)
pymongo 4.1.0
I am learning machine learning. Because I feel TensorBoard is hard to use, I try to implement a simple "traceable and visible training system" ("tvts") that has partial features of TensorBoard by MongoDB and pymongo. I choose MongoDB because it is document-based, NoSQL, and more suitable for recording arbitrary properties of model training.
Below is how I use it to record training conditions:
import tvts
# before training the modle
ts = tvts.tvts(NAME, '172.26.41.157', init_params={
'ver': VER,
'batch_size': N_BATCH_SIZE,
'lr': LR,
'n_epoch': N_EPOCH,
}, save_dir=SAVE_DIR, save_freq=SAVE_FREQ)
# after an epoch is done
ts.save(epoch, {
'cost': cost_avg,
'acc': metrics_avg[0][0],
'cost_val': cost_avg_val,
'acc_val': metrics_avg_val[0][0],
}, save_path)
I write all such data into a collection of my MondoDB, and then I can get statistics and charts like below:
Name: mnist_multi_clf_by_numpynet_mlp_softmax_ok from 172.26.41.157:27017 tvts.train_log
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| train_id | parent | cost(min:last) | LR(b-e:max-min) | epoch_count | existed save/save | from | to | duration |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 1 | None-None | 1.01055:1.01055 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 11:56:17.618000 | 2022-04-14 11:56:21.273000 | 0:00:03.655000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 2 | 1-100 | 0.56357:0.56357 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 12:00:53.170000 | 2022-04-14 12:00:56.705000 | 0:00:03.535000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 3 | 2-100 | 0.15667:0.15667 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 12:01:35.233000 | 2022-04-14 12:01:45.795000 | 0:00:10.562000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 4 | 3-300 | 0.06820:0.06839 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 18:16:08.720000 | 2022-04-14 18:16:19.606000 | 0:00:10.886000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 5 | 2-100 | 0.03418:0.03418 | 0.5:0.5 | 200 | 10/10 | 2022-04-14 18:18:27.665000 | 2022-04-14 18:18:34.644000 | 0:00:06.979000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 6 | None-None | 1.68796:1.68858 | 0.001:0.001 | 3000 | 30/30 | 2022-04-16 09:15:56.085000 | 2022-04-16 09:18:01.608000 | 0:02:05.523000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
I found out that it get stuck if I try to get the list of statistics when I densely writing into the collection at the same time. I.e. I try to get the statistics on-the-fly of training and each epoch of the training is very short (about 0.03 second).
But I found out that I can still read out the records by Stuido 3T (a GUI of MongoDB) when I densely writing into the collection.
I googled a lot, but I still cannot solve it. Someone said the writing lock is exclusive (such as link: mongodb write is occuring then a read must wait or not wait?), but why the Studio 3T can make it?
Acturally I am new to MongoDB, I can use it because I have a littel experience with MySQL and in this "tvts" there is only insertion and query, i.e. it is a rahter simple usage of MongoDB. Is there some equivalent concepts of "concurrent inserts" in MySQL? (such as link: concurrent read and write in MySQL) I guess it is not a very hard task of MongoDB to read from it when writing into it.
Although it is a simple simulation of partial features of TensorBoard, I have already coded almost 600 lines of code. So, I am sorry that changing database is not prefered.
Please help me. Thanks a lot!
Unbelievable! I accidentally solved it just a few minutes after I posted this question. It seems that MongoDB collection could be read even if there are dense insertions, and I guess it is its normal performance. I guess that I cannot google an answer because it is not a real issue. My issue may be caused by the IDE Pycharm that I am using. I have the issue if I run my script inside Pycharm. It is OK when I run it in a system shell window.

Run Kafka with podman on multiple hosts

Suppose I have three VMs/hosts which have podman installed:
each of them runs 1 zookeeper instance as container
each of the VMs/host run 2 kafka instances as containers (= 6).
my notebook will make use of the cluster as Producer/Consumer
A network/service layout I try to create:
VM/host 1 VM/host 2 VM/host 3 My Dev Comp
------------ ------------ ------------ ------------
| zoo 1 | | zoo 2 | | zoo 3 | | my |
| kafka 1-2| | kafka 3-4| | kafka 5-6| | fancy |
| | | | | | | services |
------------ ------------ ------------ ------------
|--------same-------|-------local-----|------network-----|
I try to run each of the container as rootless one, due to security restrictions. With docker of course it's easy, though you run a daemon as root, doh. I tried to understand slirp4netns and containernetworking, but I could not find an answer to my question above, because I did not understand how it works at great depth.
What's the solution to create that setup above, please, e.g. which commands are required to get this blueprint above running?

Google Cloud SQL VERY SLOW

I am thinking to migrate my website to Google Cloud SQL and I signed up for a free account (D32).
Upon testing on a table with 23k records the performances were very poor so I read that if I move from the free account to a full paid account I would have access to faster CPU and HDD... so I did.
performances are still VERY POOR.
I am running my own MySQL server for years now, upgrading as needed to handle more and more connections and to gain raw speed (needed because of a legacy application). I highly optimize tables, configuration, and heavy use of query cache, etc...
A few pages of our legacy system have over 1.5k of queries per page, currently I was able to push the mysql query time (execution and pulling of the data) down to 3.6seconds for all those queries, meaning that MySQL takes about 0.0024 seconds to execute the queries and return the values.. not the greatest but acceptable for those pages.
I upload a table involved in those many queries to Google Cloud SQL. I notices that the INSERT already takes SECONDS to execute instead than milliseconds.. but I think that it might be the sync vs async setting. I change it to async and the execution time for the insert doesn't feel like it changes. for now not a big problem, I am only testing queries for now.
I run a simple select * FROM <table> and I notice that it takes over 6 seconds.. I think that maybe the query cache needs to build.. i try again and this times it takes 4 seconds (excluding network traffic). I run the same query on my backup server after a restart and with no connections at all, and it takes less than 1 second.. running it again, 0.06 seconds.
Maybe the problem is the cache, too big... let's try a smaller subset
select * from <table> limit 5;
to my server: 0.00 seconds
GCS: 0.04
so I decide to try a dumb select on an empty table, no records at all, just created with only 1 field
to my server: 0.00 seconds
GCS: 0.03
profiling doesn't give any insights except that the query cache is not running on Google Cloud SQL and that the queries execution seems faster but .. is not...
My Server:
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000225 |
| Waiting for query cache lock | 0.000116 |
| init | 0.000115 |
| checking query cache for query | 0.000131 |
| checking permissions | 0.000117 |
| Opening tables | 0.000124 |
| init | 0.000129 |
| System lock | 0.000124 |
| Waiting for query cache lock | 0.000114 |
| System lock | 0.000126 |
| optimizing | 0.000117 |
| statistics | 0.000127 |
| executing | 0.000129 |
| end | 0.000117 |
| query end | 0.000116 |
| closing tables | 0.000120 |
| freeing items | 0.000120 |
| Waiting for query cache lock | 0.000140 |
| freeing items | 0.000228 |
| Waiting for query cache lock | 0.000120 |
| freeing items | 0.000121 |
| storing result in query cache | 0.000116 |
| cleaning up | 0.000124 |
+--------------------------------+----------+
23 rows in set, 1 warning (0.00 sec)
Google Cloud SQL:
mysql> show profile;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000061 |
| checking permissions | 0.000012 |
| Opening tables | 0.000115 |
| System lock | 0.000019 |
| init | 0.000023 |
| optimizing | 0.000008 |
| statistics | 0.000012 |
| preparing | 0.000005 |
| executing | 0.000021 |
| end | 0.000024 |
| query end | 0.000007 |
| closing tables | 0.000030 |
| freeing items | 0.000018 |
| logging slow query | 0.000006 |
| cleaning up | 0.000005 |
+----------------------+----------+
15 rows in set (0.03 sec)
keep in mind that I connect to both server remotely from a server located in VA and my server is located in Texas (even if it should not matter that much).
What am I doing wrong ? why simple queries take this long ? am I missing or not understanding something here ?
As of right now I won't be able to use Google Cloud SQL because a page with 1500 queries will take way too long (circa 45 seconds)
I know this question is old but....
CloudSQL has poor support for MyISAM tables, it's recommend to use InnoDB.
We had poor performance when migrating a legacy app, after reading through the doc's and contacting the paid support, we had to migrate the tables into InnoDB; No query cache was also a killer.
You may also find later on you'll need to tweak the mysql conf via the 'flags' in the google console. An example being 'wait_timeout' is set too high by default (imo.)
Hope this helps someone :)
Query cache is not as yet a feature of Cloud SQL. This may explain the results. However, I recommend closing this question as it is quite broad and doesn't fit the format of a neat and tidy Q&A. There are just too many variables not mentioned in the Q&A and it doesn't appear clear what a decisive "answer" would look like to the very general question of optimization when there are so many variables at play.

Best Practices for Project Feature Sub-Modules with Mercurial and Eclipse?

I have a couple of ANT projects for several different clients; the directory structure I have for my projects looks like this:
L___standard_workspace
L___.hg
L___validation_commons-sub-proj <- JS Library/Module
| L___java
| | L___jar
| L___old_stuff
| L___src
| | L___css
| | L___js
| | L___validation_commons
| L___src-test
| L___js
L___v_file_attachment-sub-proj <- JS Library/Module
| L___java
| | L___jar
| L___src
| | L___css
| | L___js
| L___src-test
| L___js
L___z_business_logic-sub-proj <- JS Library/Module
| L___java
| | L___jar
| L___src
| L___css
| L___js
L____master-proj <- Master web-deployment module where js libraries are compiled to.
L___docs
L___java
| L___jar
| L___src
| L___AntTasks
| L___build
| | L___classes
| | L___com
| | L___company
| L___dist
| L___nbproject
| | L___private
| L___src
| L___com
| L___company
L___remoteConfig
L___src
| L___css
| | L___blueprint
| | | L___plugins
| | | | L___buttons
| | | | | L___icons
| | | | L___fancy-type
| | | | L___link-icons
| | | | | L___icons
| | | | L___rtl
| | | L___src
| | L___jsmvc
| L___img
| | L___background-shadows
| | L___banners
| | L___menu
| L___js
| | L___approve
| | L___cart
| | L___confirm
| | L___history
| | L___jsmvc
| | L___mixed
| | L___office
| L___stylesheets
| L___swf
L___src-standard
Within the working copy the modules compile the sub-project into a single Javascript file that is placed in the Javascript directory of the master project.
For example, the directories:
validation_commons-sub-proj
v_file_attachment-sub-proj
z_business_logic-sub-proj
...all are combined and minified (sort of like compiled) into a different Javascript filename in the _master-proj/js directory; and in the final step the _master-proj is compiled to be deployed to the server.
Now in regards to the way I'd like to set this up with hg, what I'd like to be able to do is clone the master project and its sub-projects from their own base-line repositories into a client's working-copy, so that modules can be added (using hg) to a particular customer's working copy.
Additionally however, when I do make some changes to/fix bugs in one customer's working copy, I would like to be able to optionally push the changes/bug fixes back to the master project/sub-project's base-line repository, for purposes of eventually pulling the changes/fixes into other customer's working copies that might contain the same bugs that need to be fixed.
In this way I will be able to utilize the same bug fixes across different clients.
However...I am uncertain of the best way to do this using hg and Eclipse.
I read here that you can use hg's Convert Extension to split a sub-directory into a separate project using the --filemap option.
However, I'm still a little bit confused as to if it would be better to use the Convert Extension or if it would be better to just house each of the modules in their own repository and check them out into a single workspace for each client.
Yep, it looks like subrepos are what you are looking for, but I think maybe that is the right answer for the wrong question and I strongly suspect that you'll run into similar issues that occur when using svn:externals
Instead I would recommend that you "publish" your combined and minified JS files to an artefact repository and use a dependency manager such as Ivy to pull specific versions of your artefacts into your master project. This approach give you far greater control over the sub-project versions your master project uses.
If you need to make bug fixes to a sub-project for a particular client, you can just make the fixes on the mainline for that sub-project, publish a new version (ideally via an automated build pipeline) and update their master project to use the new version. Oh, you wanted to test the new version with the their master project before publishing? In that case, before you push your fix, combine and minify your sub-project locally, publish it to a local repository and have the client's master project pick up that version for your testing.