Suppose I have three VMs/hosts which have podman installed:
each of them runs 1 zookeeper instance as container
each of the VMs/host run 2 kafka instances as containers (= 6).
my notebook will make use of the cluster as Producer/Consumer
A network/service layout I try to create:
VM/host 1 VM/host 2 VM/host 3 My Dev Comp
------------ ------------ ------------ ------------
| zoo 1 | | zoo 2 | | zoo 3 | | my |
| kafka 1-2| | kafka 3-4| | kafka 5-6| | fancy |
| | | | | | | services |
------------ ------------ ------------ ------------
|--------same-------|-------local-----|------network-----|
I try to run each of the container as rootless one, due to security restrictions. With docker of course it's easy, though you run a daemon as root, doh. I tried to understand slirp4netns and containernetworking, but I could not find an answer to my question above, because I did not understand how it works at great depth.
What's the solution to create that setup above, please, e.g. which commands are required to get this blueprint above running?
Related
Update:
It is solved. Please check myself's answer if you are interested in it. Thanks to everyone all the same!
My original post:
MongoDB server version: 3.6.8 (WSL Ubuntu 20.04)
pymongo 4.1.0
I am learning machine learning. Because I feel TensorBoard is hard to use, I try to implement a simple "traceable and visible training system" ("tvts") that has partial features of TensorBoard by MongoDB and pymongo. I choose MongoDB because it is document-based, NoSQL, and more suitable for recording arbitrary properties of model training.
Below is how I use it to record training conditions:
import tvts
# before training the modle
ts = tvts.tvts(NAME, '172.26.41.157', init_params={
'ver': VER,
'batch_size': N_BATCH_SIZE,
'lr': LR,
'n_epoch': N_EPOCH,
}, save_dir=SAVE_DIR, save_freq=SAVE_FREQ)
# after an epoch is done
ts.save(epoch, {
'cost': cost_avg,
'acc': metrics_avg[0][0],
'cost_val': cost_avg_val,
'acc_val': metrics_avg_val[0][0],
}, save_path)
I write all such data into a collection of my MondoDB, and then I can get statistics and charts like below:
Name: mnist_multi_clf_by_numpynet_mlp_softmax_ok from 172.26.41.157:27017 tvts.train_log
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| train_id | parent | cost(min:last) | LR(b-e:max-min) | epoch_count | existed save/save | from | to | duration |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 1 | None-None | 1.01055:1.01055 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 11:56:17.618000 | 2022-04-14 11:56:21.273000 | 0:00:03.655000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 2 | 1-100 | 0.56357:0.56357 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 12:00:53.170000 | 2022-04-14 12:00:56.705000 | 0:00:03.535000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 3 | 2-100 | 0.15667:0.15667 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 12:01:35.233000 | 2022-04-14 12:01:45.795000 | 0:00:10.562000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 4 | 3-300 | 0.06820:0.06839 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 18:16:08.720000 | 2022-04-14 18:16:19.606000 | 0:00:10.886000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 5 | 2-100 | 0.03418:0.03418 | 0.5:0.5 | 200 | 10/10 | 2022-04-14 18:18:27.665000 | 2022-04-14 18:18:34.644000 | 0:00:06.979000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 6 | None-None | 1.68796:1.68858 | 0.001:0.001 | 3000 | 30/30 | 2022-04-16 09:15:56.085000 | 2022-04-16 09:18:01.608000 | 0:02:05.523000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
I found out that it get stuck if I try to get the list of statistics when I densely writing into the collection at the same time. I.e. I try to get the statistics on-the-fly of training and each epoch of the training is very short (about 0.03 second).
But I found out that I can still read out the records by Stuido 3T (a GUI of MongoDB) when I densely writing into the collection.
I googled a lot, but I still cannot solve it. Someone said the writing lock is exclusive (such as link: mongodb write is occuring then a read must wait or not wait?), but why the Studio 3T can make it?
Acturally I am new to MongoDB, I can use it because I have a littel experience with MySQL and in this "tvts" there is only insertion and query, i.e. it is a rahter simple usage of MongoDB. Is there some equivalent concepts of "concurrent inserts" in MySQL? (such as link: concurrent read and write in MySQL) I guess it is not a very hard task of MongoDB to read from it when writing into it.
Although it is a simple simulation of partial features of TensorBoard, I have already coded almost 600 lines of code. So, I am sorry that changing database is not prefered.
Please help me. Thanks a lot!
Unbelievable! I accidentally solved it just a few minutes after I posted this question. It seems that MongoDB collection could be read even if there are dense insertions, and I guess it is its normal performance. I guess that I cannot google an answer because it is not a real issue. My issue may be caused by the IDE Pycharm that I am using. I have the issue if I run my script inside Pycharm. It is OK when I run it in a system shell window.
I'm setting up airflow such that webserver runs on one machine and scheduler runs on another. Both share the same MySQL metastore database. Both instances come up without any errors in the logs but the scheduler is not picking up any DAG Runs that are created by manually triggering the DAGs via the Web UI.
The dag_run table in MysQL shows few entries, all in running state:
mysql> select * from dag_run;
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
| id | dag_id | execution_date | state | run_id | external_trigger | conf | end_date | start_date |
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
| 1 | example_bash_operator | 2017-12-14 11:33:08.479040 | running | manual__2017-12-14T11:33:08.479040 | 1 | �� }�. | NULL | 2017-12-14 11:33:09.000000 |
| 2 | example_bash_operator | 2017-12-14 11:38:27.888317 | running | manual__2017-12-14T11:38:27.888317 | 1 | �� }�. | NULL | 2017-12-14 11:38:27.000000 |
| 3 | example_branch_dop_operator_v3 | 2017-12-14 13:47:05.170752 | running | manual__2017-12-14T13:47:05.170752 | 1 | �� }�. | NULL | 2017-12-14 13:47:05.000000 |
| 4 | example_branch_dop_operator_v3 | 2017-12-15 04:26:07.208501 | running | manual__2017-12-15T04:26:07.208501 | 1 | �� }�. | NULL | 2017-12-15 04:26:07.000000 |
| 5 | example_branch_dop_operator_v3 | 2017-12-15 06:12:10.965543 | running | manual__2017-12-15T06:12:10.965543 | 1 | �� }�. | NULL | 2017-12-15 06:12:11.000000 |
| 6 | example_branch_dop_operator_v3 | 2017-12-15 06:28:43.282447 | running | manual__2017-12-15T06:28:43.282447 | 1 | �� }�. | NULL | 2017-12-15 06:28:43.000000 |
+----+--------------------------------+----------------------------+---------+------------------------------------+------------------+----------------+----------+----------------------------+
6 rows in set (0.21 sec)
But the Scheduler that's started up on another machine and connected to the same MySQL DB is just not interested in talking to this DB and actually running these DAG runs and converting them to Task Instances.
Not sure what I'm missing in the setup here. So few questions:
When and how is the DAGS folder located at $AIRFLOW_HOME/dags populated? I think its when the webserver is started. But then if I just start the scheduler on another machine, how will the DAGS folder on that machine be filled up?
Currently, I'm doing airflow initdb only on the machine hosting the webserver and not on scheduler. Hope that is correct.
Can I enable debug logs for Scheduler to get more logs that could indicate what's missing? From the current logs it looks like it just looks in the DAGS folder on local system and finds no DAGS there ( not even example ones ) inspite of the config to load examples set as True.
Don't think it matters but I'm currently using a LocalExecutor
Any help is appreciated.
Edit: I know that I need to sync up DAGS folder across machines as the airflow docs suggest but not sure if this is the reason why Scheduler is not picking up the tasks in the above case.
Ok, I got the answer - It looks like the Scheduler does not query the DB until there are any DAGS in the local DAG Folder. The code in job.py looks like
ti_query = (
session
.query(TI)
.filter(TI.dag_id.in_(simple_dag_bag.dag_ids))
.outerjoin(DR,
and_(DR.dag_id == TI.dag_id,
DR.execution_date == TI.execution_date))
.filter(or_(DR.run_id == None,
not_(DR.run_id.like(BackfillJob.ID_PREFIX + '%'))))
.outerjoin(DM, DM.dag_id==TI.dag_id)
.filter(or_(DM.dag_id == None,
not_(DM.is_paused)))
)
I added a simple DAG in my local DAG folder on the machine hosting Scheduler and it started picking up other DAG instances as well.
We raised an issue for this - https://issues.apache.org/jira/browse/AIRFLOW-1934
I'm attempting to start OrientDB in distributed mode on AWS.
I have an auto scaling group that creates new nodes as needed. When the nodes are created, they start with a default config without a node name. The idea is that the node name is generated randomly.
My problem is that the server starts up and ask for user input.
+---------------------------------------------------------------+
| WARNING: FIRST DISTRIBUTED RUN CONFIGURATION |
+---------------------------------------------------------------+
| This is the first time that the server is running as |
| distributed. Please type the name you want to assign to the |
| current server node. |
| |
| To avoid this message set the environment variable or JVM |
| setting ORIENTDB_NODE_NAME to the server node name to use. |
+---------------------------------------------------------------+
Node name [BLANK=auto generate it]:
I don't want to set the node name because I need a random name and the server never starts because it's waiting for user input.
Is there a parameter I can pass to dserver.sh that will pass this check and generate a random node name?
You could create a random string to pass to OrientDB as node name with the ORIENTDB_NODE_NAME variable. Example:
ORIENTDB_NODE_NAME=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1)
For more information about this, look at: https://gist.github.com/earthgecko/3089509
I am thinking to migrate my website to Google Cloud SQL and I signed up for a free account (D32).
Upon testing on a table with 23k records the performances were very poor so I read that if I move from the free account to a full paid account I would have access to faster CPU and HDD... so I did.
performances are still VERY POOR.
I am running my own MySQL server for years now, upgrading as needed to handle more and more connections and to gain raw speed (needed because of a legacy application). I highly optimize tables, configuration, and heavy use of query cache, etc...
A few pages of our legacy system have over 1.5k of queries per page, currently I was able to push the mysql query time (execution and pulling of the data) down to 3.6seconds for all those queries, meaning that MySQL takes about 0.0024 seconds to execute the queries and return the values.. not the greatest but acceptable for those pages.
I upload a table involved in those many queries to Google Cloud SQL. I notices that the INSERT already takes SECONDS to execute instead than milliseconds.. but I think that it might be the sync vs async setting. I change it to async and the execution time for the insert doesn't feel like it changes. for now not a big problem, I am only testing queries for now.
I run a simple select * FROM <table> and I notice that it takes over 6 seconds.. I think that maybe the query cache needs to build.. i try again and this times it takes 4 seconds (excluding network traffic). I run the same query on my backup server after a restart and with no connections at all, and it takes less than 1 second.. running it again, 0.06 seconds.
Maybe the problem is the cache, too big... let's try a smaller subset
select * from <table> limit 5;
to my server: 0.00 seconds
GCS: 0.04
so I decide to try a dumb select on an empty table, no records at all, just created with only 1 field
to my server: 0.00 seconds
GCS: 0.03
profiling doesn't give any insights except that the query cache is not running on Google Cloud SQL and that the queries execution seems faster but .. is not...
My Server:
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000225 |
| Waiting for query cache lock | 0.000116 |
| init | 0.000115 |
| checking query cache for query | 0.000131 |
| checking permissions | 0.000117 |
| Opening tables | 0.000124 |
| init | 0.000129 |
| System lock | 0.000124 |
| Waiting for query cache lock | 0.000114 |
| System lock | 0.000126 |
| optimizing | 0.000117 |
| statistics | 0.000127 |
| executing | 0.000129 |
| end | 0.000117 |
| query end | 0.000116 |
| closing tables | 0.000120 |
| freeing items | 0.000120 |
| Waiting for query cache lock | 0.000140 |
| freeing items | 0.000228 |
| Waiting for query cache lock | 0.000120 |
| freeing items | 0.000121 |
| storing result in query cache | 0.000116 |
| cleaning up | 0.000124 |
+--------------------------------+----------+
23 rows in set, 1 warning (0.00 sec)
Google Cloud SQL:
mysql> show profile;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000061 |
| checking permissions | 0.000012 |
| Opening tables | 0.000115 |
| System lock | 0.000019 |
| init | 0.000023 |
| optimizing | 0.000008 |
| statistics | 0.000012 |
| preparing | 0.000005 |
| executing | 0.000021 |
| end | 0.000024 |
| query end | 0.000007 |
| closing tables | 0.000030 |
| freeing items | 0.000018 |
| logging slow query | 0.000006 |
| cleaning up | 0.000005 |
+----------------------+----------+
15 rows in set (0.03 sec)
keep in mind that I connect to both server remotely from a server located in VA and my server is located in Texas (even if it should not matter that much).
What am I doing wrong ? why simple queries take this long ? am I missing or not understanding something here ?
As of right now I won't be able to use Google Cloud SQL because a page with 1500 queries will take way too long (circa 45 seconds)
I know this question is old but....
CloudSQL has poor support for MyISAM tables, it's recommend to use InnoDB.
We had poor performance when migrating a legacy app, after reading through the doc's and contacting the paid support, we had to migrate the tables into InnoDB; No query cache was also a killer.
You may also find later on you'll need to tweak the mysql conf via the 'flags' in the google console. An example being 'wait_timeout' is set too high by default (imo.)
Hope this helps someone :)
Query cache is not as yet a feature of Cloud SQL. This may explain the results. However, I recommend closing this question as it is quite broad and doesn't fit the format of a neat and tidy Q&A. There are just too many variables not mentioned in the Q&A and it doesn't appear clear what a decisive "answer" would look like to the very general question of optimization when there are so many variables at play.
I have multiple CouchDB servers I want to keep in sync with each other, and I use these servers to share large files (e.g. >100 MB). To keep them synchronized, I have each CouchDB instance do a continuous pull replication from each other instance.
Here's an example: I have three CouchDB servers A, B, & C, all of which have continuous pull replications from each other, as so:
------- <------------- -------
| A | -------------> | B |
------- -------
^ | | ^
| | | |
| V | |
------- <---------------- |
| C | -------------------
-------
Someone uploads a document to server A with a 500MB attachment. B and C both start replicating the document from A, and B finishes the replication before C does:
------- doc -------
| A |--------------->| B |
------- -------
|
| doc
V
-------
| C |
-------
My question is, will C then start replicating the same document from B (since C also has a continuous pull replication from B), while it is still transferring the document from A?
------- -------
| A | | B |
------- -------
| doc |
doc| |------------------
| |
V V
-------
| C |
-------
I would guess this would happen, since AFAIK, CouchDB replication doesn't actually store the replicated documents to the target (using the _bulk_docs API) until the documents (including attachments) have been fully fetched from the source[1]. I am worried about this happening since it would be redundant and a big waste of bandwidth.
[1] https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm
According to a recent discussion on the CouchDB users# list and to this document describing the replication algorithm the replication knows which attachment is already present on the target. If, however, the attachments are very large and both ends start replicating before either of them has finished, the attachment will be transferred multiple times.