airflow tries to access celery workers using the worker ID instead of URL - celery

I have Airflow running with CeleryExecutor and 2 workers. When my DAG runs, the tasks generate a log on the filesystem of the worker that ran them. But when I go to the Web UI and click on the task logs, I get:
*** Log file does not exist: /usr/local/airflow/logs/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Fetching from: http://70953abf1c10:8793/log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='70953abf1c10', port=8793): Max retries exceeded with url: /log/test_dag/task2/2019-11-01T18:12:16.309655+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f329c3a2650>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
http://70953abf1c10:8793/ is obviously not the correct IP of the worker. However, celery#70953abf1c10 is the name of this worker in Celery. It seems like Airflow is trying to learn the worker's URL from Celery, but Celery is giving the worker's name instead. How can I solve this?

DejanLekic's solution put me on the right track, but it wasn't entirely obvious, so I'm adding this answer to clarify.
In my case I was running Airflow on Docker containers. By default, Docker containers use a bridge network called bridge. This is a special network that does not automatically resolve hostnames. I created a new bridge network in Docker called airflow-net and had all my Airflow containers join this one (leaving the default bridge was not necessary). Then everything just worked.
By default, Docker sets the hostname to the hex ID of the container. In my case the container ID began with 70953abf1c10 and the hostname was also 70953abf1c10. There is a Docker parameter for specifying hostname, but it turned out to not be necessary. After I connected the containers to a new bridge network, 70953abf1c10 began to resolve to that container.

Simplest solution is either to use the default name, which will include the hostname, or to explicitly set the node name that has a valid host name in it (example: celery1#hostname.domain.tld).
If you use the default settings, then machine running the airflow worker has incorrectly set hostname to 70953abf1c10. You should fix this by running something like: hostname -B hostname.domain.tld

Related

service XXX was unable to place a task because no container inst met all of its reqmnts. instance XXX is already using a port required by your task

service crm was unable to place a task because no container instance met all of its requirements. The closest matching container-instance e45856e4821149XXXXXXXXX is already using a port required by your task.
is there any way to resolve this, currently i have trying to run 4 task-definition i have referred below AWS documents not sure which solution will be ideal to resolve current issue ? dynamic porting how to do it ?
registered ports : ["22","4000","2376","2375","51678","51679"]
https://aws.amazon.com/premiumsupport/knowledge-center/dynamic-port-mapping-ecs/
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-container-instance-requirement-error/
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html#service-event-messages-1
tried referring AWS docs for current issue, not sure how to resolve port issue.
If you create port mappings in your task definition, you will occupy the ports on the host. If you do not create port mappings in your task definition (and only specify the container port) you will receive a dynamically allocated port on the host automatically.
So: don't specify the host port in the task definition.
The target group associated with your task can be used to dynamically target the tasks from, say, a load balancer or other resources supporting target groups.
Or you can create more instances in your autoscaling group so that your task can be placed on an instance where the port is not in use. You can use capacity providers to automatically create new instances when needed. Though, this is likely far less efficient than dynamic port mapping, depending on the performance characteristics of your workloads.

Dynamic port mapping for ECS tasks

I want to run a socket program in aws ecs with client and server in one task definition. I am able to run it when I use awsvpc network mode and connect to server on localhost every time. This is good so I don’t need to know the IP address of server. The issue is server has to start on some port and if I run 10 of these tasks only 3 tasks(= number of running instances) run at a time. This is clearly because 10 tasks cannot open the same port. I can manually check for open ports before starting the server and somehow write it to docker shared volume where client can read and connect. But this seems complicated and my server has unnecessary code. For the Services there is dynamic port mapping by using Application Load Balancer but there isn’t anything for simply running tasks.
How can I run multiple socket programs without having to manage the port number in Aws ecs?
If you're using awsvpc mode, each task will get its own eni and there shouldn't be any port conflict. But each instance type has a limited number of enis available. You can increase that by enabling eni trunking which, however is supported by a handful of instance types:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html#eni-trunking-supported-instance-types

mongodb mms monitoring agent does not find group members

I have installed the latest mongodb mms agent (6.5.0.456) on ubuntu 16.04 and initialised the replicaset. Hence I am running a single node replicaset with the monitoring agent enabled. The agent works fine, however it does not seem to actually find the replicaset member:
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:170] Received new configuration: Primary agent, Assigned 0 out of 0 plus 0 chunk monitor(s)
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:182] Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the Group.
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Run:199] Done. Sleeping for 55s...
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:746] Performing discovery with 0 hosts
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:803] Received discovery responses from 0/0 requests after 891ns
I can see two processes for monitor agents:
/bin/sh -c /usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config >> /var/log/mongodb-mms/monitoring-agent.log 2>&1
/usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config
However if I terminate one, it also tears down the other, so I do not think that is the problem.
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
The rs.config() looks fine, with one replicaset member, which has a host field, which looks just fine. I can use that value to connect to the instance using the mongo command. no auth is configured.
EDIT
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset. This seems to be different to pre-cloud-manager days, where the agent was able to track the rs - if I remember correctly... Probably there still is a way to get this done easier, so I am leaving this question open for now...
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
Configuration values for the Cloud Manager agent (such as mmsGroupId and mmsApiKey) are set in the config file, which is /etc/mongodb-mms/monitoring-agent.config by default. The agent needs this information in order to communicate with the Cloud Manager servers.
For more details, see Install or Update the Monitoring Agent and Monitoring Agent Configuration in the Cloud Manager documentation.
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset.
Unless a MongoDB process is already managed by Cloud Manager automation, I believe it has always been the case that you need to add an existing MongoDB process to monitoring to start the process of initial topology discovery. Once a deployment is monitored, any changes in deployment membership should automatically be discovered by the Cloud Manager agent.
Production employments should have authentication and access control enabled, so in addition to adding a seed hostname and port via the Cloud Manager UI you usually need to provide appropriate credentials.

Can I tie Celery workers to a particular instance given a shared database?

I have a number of machines each with a Django instance, sharing a single Postgres database.
I want to run Celery, preferably using the Django broker and the Postgres database for simplicity. I do not have a high volume of tasks to run, so there is no need to use a different broker for that reason.
I want to run celery tasks which operate on local file storage. This means that I want the celery worker only to run tasks which are on the same machine that triggered the event.
Is this possible with the current setup? If not, how do to it? A local Redis instance for each machine?
I worked out how to make this work. No need for fancy routing or brokers.
I run each celeryd instance with a special queue named after the host. This can be done automatically, like:
./manage.py celeryd -Q celery,`hostname`
I then set up a hostname in the settings.py that stores the hostname:
import socket
CELERY_HOSTNAME = socket.gethostname()
In each Django instance this will have a different value.
I can then specify this queue when I asynchronously call my task:
my_task.apply_async(args=[one, two], queue=settings.CELERY_HOSTNAME)

Condor central manager could not see the other computing nodes

I connect three servers to form an HPC cluster using condor as a middleware when I run the command condor_status from the central manager it does not shows the other nodes I can run jobs in the central manager and connect to the other nodes via SSH but it seems that there is something missing in condor configuration files where I set the central manager as condor host and allows writing and reading for everyone. I keep the daemon MASTER, STARTD in the daemon list for the worker nodes.
When I run condor_status in the central manager it just show the central manager and when I run it on the compute node it give me the error "CEDAR:6001:Failed to connect to" followed by the central manager IP and port number.
I manage to solve it. The problem was in the central manager's firewall (in my case it was iptables) which was running.
So, when I stopped the firewall (su -c "service iptables stop") all nodes appeared normally, typing condor_status".
The firewall status can be checked using "service iptables status".
There are a number of things that could be going on here. I'd suggest you follow this tutorial and see if it resolves your problems -
http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/
In my case the service "condor.exe" was not running on the server. I had stopped manually. I just start it and every thing went fine.