Need help setting BROKER_URL in Airflow's config and Celery Executor - celery

Summary
I'm using Apache-Airflow for the first time. I've gotten the webserver, SequentialExecutor and LocalExecutor to work, but I'm running into issues when using the CeleryExecutor with rabbitmq-server. I currently have two AWS EC2 instances.
Error
To summarize: My worker cannot connect to the rabbitmq-server on my scheduler node. Whenever I run airflow worker on the worker instance, it gives:
- ** ---------- [config]
- ** ---------- .> app: airflow.executors.celery_executor:0x7f53a8dce400
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> default exchange=default(direct) key=default
[2019-02-15 02:26:23,742: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#127.0.0.1:5672//: [Errno 111] Connection refused.
Configuration
I followed all of the directions I could find online. Both instances have the same airflow.cfg file, with
[core]
executor = CeleryExecutor
[celery]
broker_url = pyamqp://username:password#hostname:port/virtual_host
and result_backend pointing at the same MySQL database on RDS that airflow is working off of.
From what I could tell, no matter what, the worker node always tried connecting to a local rabbitmq-server and completely ignored that broker_url in my airflow.cfg file.
What I've Tried
I went spelunking in the source code, and noticed in celery/app/base.py, if I error log out the configurations it gets in _get_config() when it goes to create a connection, there are actually TWO values in the dictionary returned.
BROKER_URL = None
broker_url = pyamqp://username:password#hostname:port/virtual_host
and all of the connection logic seems to point at the BROKER_URL key.
I tried setting BROKER_URL and CELERY_BROKER_URL in airflow.cfg, but it seems to be case insensitive, and ignores the latter. Just to see if it would work, I modified the _get_config() method and hacked in:
s['BROKER_URL'] = s['broker_url']
return s
And, like I expected, everything started working.
Am I doing something wrong? I'd really rather not use this hack, but I can't understand why it's ignoring the configuration values.
Thanks!

From the error message, it seems like the hostname being passed in the URI is wrong:
If rabbitmq-server and worker are in different machines: instead of localhost/127.0.0.1, the hostname should be the IP address of the rabbitmq machine
If rabbitmq-server and worker are in the same machine as part of a Docker Compose application (e.g. if you took inspiration from here): the hostname should be the service name associated to the RabbitMQ image in docker-compose.yml, e.g. amqp://guest:guest#rabbitmq:5672/

Related

Can't talk to HBase from different kubernetes namespace: java.net.UnknownHostException: hregion-0.hregion

I am using kubernetes, where I have a Hadoop cluster running in namespace 'platform'.
I have an example application running in namespace 'example'
The example application needs to talk to HBase. When it does so, we see the following error:
java.net.UnknownHostException: hregion-0.hregion
at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
at org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1192)
at org.apache.hadoop.hbase.client.ClientServiceCallable.setStubByServiceName(ClientServiceCallable.java:44)
at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:229)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360)
at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1078)
at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:403)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:445)
at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:442)
at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58)
at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3084)
at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3076)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:442)
The command
> nslookup hregion-0.hregion
on the client machine fails, because the hregion service is in the platform namespace (where that command will succeed).
We suspected that the HBase region server has registered itself with zookeeper using an incomplete name, and verified by connecting to the zookeeper server:
[zk: localhost:2181(CONNECTED) 8] ls /hbase/rs
[hregion-0.hregion,16020,1560851357442]
The ConnectionUtils.getStubKey method simply uses java.net.InetAddress.getByName(hostname) and it is this method which fails.
Here is some zookeeper debugging info (this from the HBase master):
hbase(main):001:0> zk_dump
HBase is rooted at /hbase
Active master address: hmaster-0.hmaster.platform.svc.cluster.local,16000,1560851357485
Backup master addresses:
Region server holding hbase:meta: hregion-0.hregion,16020,1560851357442
Region servers:
hregion-0.hregion,16020,1560851357442
On the hregion-0 server, we have the following entries in /etc/hosts:
# Kubernetes-managed hosts file.
127.0.0.1 localhost
10.1.14.53 hregion-0.hregion.platform.svc.cluster.local hregion-0
And the /etc/resolv.conf file looks like this:
nameserver 10.96.0.10
search platform.svc.cluster.local svc.cluster.local cluster.local mycompany.com
options ndots:5
How do I fix this? I assume I need to tell HBase to register its nodes in zookeeper using their fully qualified domain name - how?

How to call a Celery shared_task?

I'm trying to use stream_framework in my application (NOT Django) but I'm having a problem calling the stream_framework shared tasks. Celery seems to find the tasks:
-------------- celery#M3800 v3.1.25 (Cipater)
---- **** -----
--- * *** * -- Linux-4.15.0-34-generic-x86_64-with-Ubuntu-18.04-bionic
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: task:0x7f8d22176dd8
- ** ---------- .> transport: redis://localhost:6379/0
- ** ---------- .> results: redis://localhost:6379/0
- *** --- * --- .> concurrency: 8 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[tasks]
. formshare.processes.feeds.tasks.test_shared_task
. stream_framework.tasks.fanout_operation
. stream_framework.tasks.fanout_operation_hi_priority
. stream_framework.tasks.fanout_operation_low_priority
. stream_framework.tasks.follow_many
. stream_framework.tasks.unfollow_many
[2018-09-17 10:06:28,240: INFO/MainProcess] Connected to redis://localhost:6379/0
[2018-09-17 10:06:28,246: INFO/MainProcess] mingle: searching for neighbors
[2018-09-17 10:06:29,251: INFO/MainProcess] mingle: all alone
I run celery with:
celery -A formshare.processes.feeds.celery_app worker --loglevel=info
My celery_app has:
from celery import Celery
celeryApp = Celery('task', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0', include='formshare.processes.feeds.tasks')
The problem is that delay() does not run the shared task. I also created a shared task within my application but when I call delay() the task is also not called. I guess I need to register them as callable from my application? I don't seem to find any information online.
I also tried to auto discover the tasks but I got the same problem:
celeryApp.autodiscover_tasks(['stream_framework', 'formshare.processes.feeds'],force=True)
Any idea is highly appreciated.
Shared task are a specific thing used to actually share tasks between different applications (mainly Django apps I think, but I used them in flask for example).
We had the same issue and to get it to work we set
 celery_app.set_default()
On the celery instantiation
Otherwise another way of getting things right is to actually call the task via the app itself, so something around these lines
from celery import current_app
.
.
.
current_app.tasks['my.tasks.to.exec'].delay(something)
This always works as, given it s a shared task and therefore is not bound to any app when you import it, in this case it belongs to the app configured as the "current_app"

mesos-master crash with zookeeper cluster

I am deploying a zookeeper cluster which has 3 nodes. I use it to keep my mesos master high availability. I download the zookeeper-3.4.6.tar.gz tarball and uncompress it to /opt, rename it to /opt/zookeeper, enter the directory, edit the conf/zoo.cfg(pasted below), create a myid file in dataDir(which is set to /var/lib/zookeeper in zoo.cfg), and start zookeeper using ./bin/zkServer.sh start, and it goes well. I start all the 3 nodes one by one and they all seems well. I use ./bin/zkCli.sh to connect the server , no problem.
But when I start mesos (3 masters and 3 slaves, each node runs a master and a slave), then the masters soon crashed, one by one, and in the webpage http://mesos_master:5050, slave tab, no slaves are displayed. But when I run only one zookeeper, these are all fine. So I think it's the zookeeper cluster's problem.
I got 3 PV host in my ubuntu server. they are all running ubuntu 14.04 LTS:
node-01, node-02, node-03,
I have /etc/hosts in all three nodes like this:
172.16.2.70 node-01
172.16.2.81 node-02
172.16.2.80 node-03
I installed zookeeper, mesos on all the three nodes. Zookeeper configure file is like this (all three nodes) :
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=node-01:2888:3888
server.2=node-02:2888:3888
server.3=node-03:2888:3888
they can be started normally and run well. And then I start the mesos-master service, using the command line ./bin/mesos-master.sh --zk=zk://172.16.2.70:2181,172.16.2.81:2181,172.16.2.80:2181/mesos --work_dir=/var/lib/mesos --quorum=2, and after a few seconds, it gives me errors like this:
F0817 15:09:19.995256 2250 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
# 0x7fa2b8be71a2 google::LogMessage::Fail()
# 0x7fa2b8be70ee google::LogMessage::SendToLog()
# 0x7fa2b8be6af0 google::LogMessage::Flush()
# 0x7fa2b8be9a04 google::LogMessageFatal::~LogMessageFatal()
▽
# 0x7fa2b81a899a mesos::internal::master::fail()
▽
# 0x7fa2b8262f8f _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
▽
# 0x7fa2b823fba7 _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
# 0x7fa2b820f9f3 _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
# 0x7fa2b826305c _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
# 0x4a44e7 std::function<>::operator()()
# 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
# 0x499480 process::Future<>::fail()
# 0x7fa2b806b4b4 process::Promise<>::fail()
# 0x7fa2b826011b process::internal::thenf<>()
# 0x7fa2b82a0757 _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
# 0x7fa2b82962d9 std::_Bind<>::operator()<>()
# 0x7fa2b827ee89 std::_Function_handler<>::_M_invoke()
I0817 15:09:20.098639 2248 http.cpp:283] HTTP GET for /master/state.json from 172.16.2.84:54542 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
# 0x7fa2b8296507 std::function<>::operator()()
# 0x7fa2b827efaf _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
# 0x7fa2b82a07fe _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
# 0x7fa2b8296507 std::function<>::operator()()
# 0x7fa2b82e4419 process::internal::run<>()
# 0x7fa2b82da22a process::Future<>::fail()
# 0x7fa2b83136b5 std::_Mem_fn<>::operator()<>()
# 0x7fa2b830efdf _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
# 0x7fa2b8307d7f _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
# 0x7fa2b82fe431 _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
# 0x7fa2b830f065 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
# 0x4a44e7 std::function<>::operator()()
# 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
# 0x7fa2b82da202 process::Future<>::fail()
# 0x7fa2b82d2d82 process::Promise<>::fail()
Aborted
sometimes the warning is like this, and then crashed with the same output above:
0817 15:09:49.745750 2104 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I want to know whether zookeeper is deployed and run well in my case, and How can I locate where the problem is. Any answers and suggests are welcomed. thanks.
Actually, in my case, It's because I didn't open firewall port 5050 to allow three servers to communicate with each others. After updating firewall rule, it starts to work as expected.
I fall into same issue, I tried different ways and different options and finally --ip option worked for me. Initially I used --hostname option
mesos-master --ip=192.168.0.13 --quorum=2 --zk=zk://m1:2181,m2:2181,m3:2181/mesos --work_dir=/opt/mm1 --log_dir=/opt/mm1/logs
You need to check that all mesos/zookeeper master nodes can communicate correctly. For that, you need:
Zookeeper ports open: TCP 2181, 2888, 3888
Mesos port open: TCP 5050
ping available (ICMP message 0 and 8)
If you use FQDN instead of IP in your config, check that the DNS resolution is working correctly as well.
Split your mesos masters' work_dir to different dir, do not use a share work_dir for all masters, because of zk

celeryd insists on checking amqp when its configured to use redis

I am trying to run celerdy + redis in my setup.
CELERYD_NODES="worker1"
CELERYD_NODES="worker1 worker2 worker3"
CELERY_BIN="/home/snijsure/.virtualenvs/mtest/bin/celery"
CELERYD_CHDIR="/home/snijsure/work/mytest/"
CELERYD_OPTS="--time-limit=300 --concurrency=8"
CELERYD_LOG_FILE="/var/log/celery/%N.log"
CELERYD_PID_FILE="/var/run/celery/%N.pid"
CELERYD_USER="celery"
CELERYD_GROUP="celery"
CELERY_CREATE_DIRS=1
export DJANGO_SETTINGS_MODULE="analytics.settings.local"
I have following in my base.py
BROKER_URL = 'redis://localhost:6379/0'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
BROKER_HOST = "localhost"
BROKER_BACKEND="redis"
REDIS_PORT=6379
REDIS_HOST = "localhost"
BROKER_USER = ""
BROKER_PASSWORD =""
BROKER_VHOST = "0"
REDIS_DB = 0
REDIS_CONNECT_RETRY = True
CELERY_SEND_EVENTS=True
CELERY_RESULT_BACKEND='redis'
CELERY_TASK_RESULT_EXPIRES = 10
CELERYBEAT_SCHEDULER="djcelery.schedulers.DatabaseScheduler"
CELERY_ALWAYS_EAGER = False
import djcelery
djcelery.setup_loader()
However when I start the celeryd using /etc/init.d/celerdy start
I see following messages in my log files
[2014-08-14 23:16:41,430: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 32.00 seconds...
It seems like its trying to connect to amqp. Any ideas on why that is I have followed procedure outlined here
http://celery.readthedocs.org/en/latest/getting-started/brokers/redis.html
I am running version 3.1.13 (Cipater)
What am I doing wrong?
-Subodh
How do you start you celery worker? I encounter this error once because I didn't start it right. You should add -A option when execute "celery worker" so that celery will connect to the broker you configured in your Celery Obj. Otherwise celery will try to connect the default broker.
Your /etc/default/celeryd file looks ok.
You are using djcelery, however. I'd recommend you drop that. If you look at the Django setup guide and example project you will notice that there are no longer any INSTALLED_APPS required for celery. It appears that djcelery is now only recommended if you want to use the Django SQL database as a backend.
https://github.com/celery/celery/tree/3.1/examples/django/
http://celery.readthedocs.org/en/latest/django/first-steps-with-django.html#using-celery-with-django
I've just rebuilt against that pattern and I can confirm that it works ok, at least in terms of connecting to Redis rather than trying to use RabbitMQ (amqp).

Hector test example not working on Cassandra 0.7.4

I have set up my single node Cassandra 0.7.4 and started the service with
bin/cassandra -f. Now I am trying to use the Hector API (v. 0.7.0) to manage the
DB.
The Cassandra CLI works fine and I can create keyspaces and so on.
I tried to run the test example and create a single keyspace:
Cluster cluster = HFactory.getOrCreateCluster("TestCluster",
new CassandraHostConfigurator("localhost:9160"));
Keyspace keyspace = HFactory.createKeyspace("Keyspace1", cluster);
But all I get is this:
2011-04-14 22:20:27,469 [main ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService
- Downed Host
Retry service started with queue size -1 and retry delay 10s
2011-04-14 22:20:27,492 [main ] DEBUG
me.prettyprint.cassandra.connection.HThriftClient -
Transport open status false
for client CassandraClient<localhost:9160-1>
....this again about 20 times
me.prettyprint.cassandra.service.JmxMonitor - Registering JMX
me.prettyprint.cassandra.service_TestCluster:ServiceType=hector,
MonitorType=hector
2011-04-14 22:20:27,636 [Thread-0 ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService -
Downed Host
retry shutdown hook called
2011-04-14 22:20:27,646 [Thread-0 ] INFO
me.prettyprint.cassandra.connection.CassandraHostRetryService -
Downed Host
retry shutdown complete
Can you please tell me what I'm doing wrong?
Thanks
When you connect via the CLI, do you specify "-h localhost -p 9160"?
Can you actually do stuff on the command line with the above?
The error from HThriftClient indicates it could not connect to the Cassandra Daemon.
FTR, you would get responses much faster via hector-users#googlegroups.com
If you are on a linux machine, try starting up your cassandra server by this command:
/bin$ ./cassandra start -f
Then for the cli, use this command:
./cassandra-cli -h {hostname}/9160.
Then make sure that the configures are ok.