How to call a REST end point using Airflow DAG - rest

I'm new to Apache Airflow. I want to call a REST end point using DAG.
REST end point for example
#PostMapping(path = "/api/employees", consumes = "application/json")
Now I want to call this rest end point using Airflow DAG, and schedule it. What I'm doing is using SimpleHttpOperator to call the Rest end point.
t1 = SimpleHttpOperator(
task_id='post_op',
endpoint='http://localhost:8084/api/employees',
data=json.dumps({"department": "Digital","id": 102,"name": "Rakesh","salary": 80000}),
headers={"Content-Type": "application/json"},
dag=dag,)
When I trigger the DAG the task is getting failed
[2019-12-30 09:09:06,330] {{taskinstance.py:862}} INFO - Executing <Task(SimpleHttpOperator):
post_op> on 2019-12-30T08:57:00.674386+00:00
[2019-12-30 09:09:06,331] {{base_task_runner.py:133}} INFO - Running: ['airflow', 'run',
'example_http_operator', 'post_op', '2019-12-30T08:57:00.674386+00:00', '--job_id', '6', '--pool',
'default_pool', '--raw', '-sd', 'DAGS_FOLDER/ExampleHttpOperator.py', '--cfg_path',
'/tmp/tmpf9t6kzxb']
[2019-12-30 09:09:07,446] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,445] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2019-12-30 09:09:07,446] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,446] {{dagbag.py:92}} INFO - Filling up the DagBag from
/usr/local/airflow/dags/ExampleHttpOperator.py
[2019-12-30 09:09:07,473] {{base_task_runner.py:115}} INFO - Job 6: Subtask post_op [2019-12-30
09:09:07,472] {{cli.py:545}} INFO - Running <TaskInstance: example_http_operator.post_op 2019-12-
30T08:57:00.674386+00:00 [running]> on host 855dbc2ce3a3
[2019-12-30 09:09:07,480] {{http_operator.py:87}} INFO - Calling HTTP method
[2019-12-30 09:09:07,483] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,483]
{{base_hook.py:84}} INFO - Using connection to: id: http_default. Host: https://www.google.com/,
Port: None, Schema: None, Login: None, Password: None, extra: {}
[2019-12-30 09:09:07,484] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,484]
{{http_hook.py:131}} INFO - Sending 'POST' to url:
https://www.google.com/http://localhost:8084/api/employees
[2019-12-30 09:09:07,501] {{logging_mixin.py:112}} INFO - [2019-12-30 09:09:07,501]
{{http_hook.py:181}} WARNING - HTTPSConnectionPool(host='www.google.com', port=443): Max retries
exceeded with url: /http://localhost:8084/api/employees (Caused by SSLError(SSLError("bad handshake:
SysCallError(-1, 'Unexpected EOF')"))) Tenacity will retry to execute the operation
[2019-12-30 09:09:07,501] {{taskinstance.py:1058}} ERROR -
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url:
/http://localhost:8084/api/employees (Caused by SSLError(SSLError("bad handshake: SysCallError(-1,
'Unexpected EOF')")))
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket
cnx.do_handshake()
File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1934, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1664, in _raise_ssl_error
raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 394, in connect
ssl_context=context,
File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 370, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 491, in wrap_socket
raise ssl.SSLError("bad handshake: %r" % e)
ssl.SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
Airflow is running on Docker and the docker image is puckel/docker-airflow.
Why it is calling the host http_default. Host: https://www.google.com/

You need to consider both the Operator you are using and the underlying Hook which it uses to connect.
The Hook fetches connection information from an Airflow Connection which is just a container used to store credentials and other connection information. You can configure Connections in the Airflow UI (using the Airflow UI -> Admin -> Connections).
So in this case, you need to first configure your HTTP Connection.
From the http_hook documentation:
http_conn_id (str) – connection that has the base API url i.e https://www.google.com/
It so happens that for the httpHook, you should configure the Connection by setting the host argument equal to the base_url of your endpoint: http://localhost:8084/.
Since your operator has the default http_conn_id, the hook will use the Airflow Connection called "http_default" in the Airflow UI.
If you don't want to change the default one you can create another Airflow Connection using the Airflow UI, and pass the new conn_id argument to your operator.
See the source code to get a better idea how the Connection object is used.
Lastly, according to the http_operator documentation:
endpoint (str) – The relative part of the full url. (templated)
You should only be passing the relative part of your URL to the operator. The rest it will get from the underlying http_hook.
In this case, the value of endpoint for your Operator should be api/employees (not the full URL).
The Airflow project documentation is unfortunately not very clear in this case. Please consider contributing an improvement, they are always welcome :)

I think you need to set your ENV variable of connection string in your Dockerfile or docker run command:
ENV AIRFLOW__CORE__SQL_ALCHEMY_CONN my_conn_string
see this and this
Connections
The connection information to external systems is stored in the
Airflow metadata database and managed in the UI (Menu -> Admin ->
Connections) A conn_id is defined there and hostname / login /
password / schema information attached to it. Airflow pipelines can
simply refer to the centrally managed conn_id without having to hard
code any of this information anywhere.
Many connections with the same conn_id can be defined and when that is
the case, and when thehooks uses the get_connection method from
BaseHook, Airflow will choose one connection randomly, allowing for
some basic load balancing and fault tolerance when used in conjunction
with retries.
Airflow also has the ability to reference connections via environment
variables from the operating system. The environment variable needs to
be prefixed with AIRFLOW_CONN_ to be considered a connection. When
referencing the connection in the Airflow pipeline, the conn_id should
be the name of the variable without the prefix. For example, if the
conn_id is named POSTGRES_MASTER the environment variable should be
named AIRFLOW_CONN_POSTGRES_MASTER. Airflow assumes the value returned
from the environment variable to be in a URI format
(e.g.postgres://user:password#localhost:5432/master).
see this
therefore you are now using the default:
Using connection to: id: http_default. Host: https://www.google.com/

Related

kafka-python: Connection reset during recv when using SASL_SSL + SCRAM-SHA-512

I am using kafka-python to connect to Kafka Cluster using SASL
consumer = KafkaConsumer(bootstrap_servers=['fooserver1:9092', 'fooserver2:9092'], client_id='foo', api_version=(2,2,1), security_protocol='SASL_SSL', sasl_mechanism='SCRAM-SHA-512', sasl_plain_username='myusername', sasl_plain_password='password123')
However I am getting the following error while connecting:
<BrokerConnection node_id=bootstrap-0 host=fooserver1:9092 <authenticating> [IPv4 ('my.ip.ad.dress', 9092)]>: Error receiving reply from server
Traceback (most recent call last):
File "/opt/python/kafka/conn.py", line 692, in _try_authenticate_scram(data_len,) = struct.unpack('>i', self._recv_bytes_blocking(4))
File "/opt/python/kafka/conn.py", line 616, in _recv_bytes_blocking raise ConnectionError('Connection reset during recv')
ConnectionError: Connection reset during recv
I have made sure that appropriate ports are open for establishing connections.
I need help in resolving this error.
This error may appear if you enter an incorrect username/password combination.
You could try verifying whether the username/password used when configuring the Kafka cluster is the same username/password you are using in kafka-python.

Need help setting BROKER_URL in Airflow's config and Celery Executor

Summary
I'm using Apache-Airflow for the first time. I've gotten the webserver, SequentialExecutor and LocalExecutor to work, but I'm running into issues when using the CeleryExecutor with rabbitmq-server. I currently have two AWS EC2 instances.
Error
To summarize: My worker cannot connect to the rabbitmq-server on my scheduler node. Whenever I run airflow worker on the worker instance, it gives:
- ** ---------- [config]
- ** ---------- .> app: airflow.executors.celery_executor:0x7f53a8dce400
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 16 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> default exchange=default(direct) key=default
[2019-02-15 02:26:23,742: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#127.0.0.1:5672//: [Errno 111] Connection refused.
Configuration
I followed all of the directions I could find online. Both instances have the same airflow.cfg file, with
[core]
executor = CeleryExecutor
[celery]
broker_url = pyamqp://username:password#hostname:port/virtual_host
and result_backend pointing at the same MySQL database on RDS that airflow is working off of.
From what I could tell, no matter what, the worker node always tried connecting to a local rabbitmq-server and completely ignored that broker_url in my airflow.cfg file.
What I've Tried
I went spelunking in the source code, and noticed in celery/app/base.py, if I error log out the configurations it gets in _get_config() when it goes to create a connection, there are actually TWO values in the dictionary returned.
BROKER_URL = None
broker_url = pyamqp://username:password#hostname:port/virtual_host
and all of the connection logic seems to point at the BROKER_URL key.
I tried setting BROKER_URL and CELERY_BROKER_URL in airflow.cfg, but it seems to be case insensitive, and ignores the latter. Just to see if it would work, I modified the _get_config() method and hacked in:
s['BROKER_URL'] = s['broker_url']
return s
And, like I expected, everything started working.
Am I doing something wrong? I'd really rather not use this hack, but I can't understand why it's ignoring the configuration values.
Thanks!
From the error message, it seems like the hostname being passed in the URI is wrong:
If rabbitmq-server and worker are in different machines: instead of localhost/127.0.0.1, the hostname should be the IP address of the rabbitmq machine
If rabbitmq-server and worker are in the same machine as part of a Docker Compose application (e.g. if you took inspiration from here): the hostname should be the service name associated to the RabbitMQ image in docker-compose.yml, e.g. amqp://guest:guest#rabbitmq:5672/

Apache Airflow celery executor is not getting result backend

I am running Apache Airflow version 1.9.0 and when I try to run a task from UI, I get the following error in airflow scheduler console:
[2018-05-08 12:09:06,737] {jobs.py:1077} INFO - No tasks to consider for execution.
[2018-05-08 12:09:06,738] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-08 12:09:06,738] {celery_executor.py:101} ERROR - Error syncing the celery executor, ignoring it:
[2018-05-08 12:09:06,738] {celery_executor.py:102} ERROR - No result backend configured. Please see the documentation for more information.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", line 83, in sync
state = async.state
File "/usr/local/lib/python2.7/dist-packages/celery/result.py", line 329, in state
return self.backend.get_status(self.id)
File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 547, in _is_disabled
'No result backend configured. '
NotImplementedError: No result backend configured. Please see the documentation for more information.
In my airflow.cfg, I have the following variables in [celery] section:
celery_app_name = airflow.executors.celery_executor
celeryd_concurrency = 16
worker_log_server_port = 8795
broker_url = amqp://guest:guest#localhost:5672//
celery_result_backend = amqp://guest:guest#localhost:5672//
flower_host = 0.0.0.0
flower_port = 5555
default_queue = default
What am I doing wrong here?
You should not point celery_result_backend to a RabbitMQ instance since the purpose of this backend is to store information concerning the status of the tasks and RabbitMQ is not the right tool for that (Please correct me if I'm mistaken).
You can use Redis in case you want to keep using the same instance as broker and backend, or alternatively you can use postgres as the backend which I recommend. A sample configuration for Postgres would be the following:
celery_result_backend = db+postgresql://airflow:****#postgres/airflow
More info on the official docs: Here

Datastax Cassandra Driver always attempts to connect to localhost, even though it's not configured to do so

So I have the following Client code:
def getCluster:Session = {
import collection.JavaConversions._
val endpoints = config.getStringList("cassandra.server")
val keyspace = config.getString("cassandra.keyspace")
val clusterBuilder = Cluster.builder
endpoints.toTraversable.map { x =>
clusterBuilder.addContactPoint(x)
}
val cluster = clusterBuilder.build
cluster
.getConfiguration
.getProtocolOptions
.setCompression(ProtocolOptions.Compression.LZ4)
cluster.connect(keyspace)}
which is shamelessly borrowed from the examples in datastax's driver documentation.
When I attempt to execute code with it, it always tries to connect to localhost, even though it's not configured for that...
In some cases, it will connect (basic reads) but for writes I get the following log message:
2016-07-07 11:34:31 DEBUG Connection:157 - Connection[/127.0.0.1:9042-10, inFlight=0, closed=false] Error connecting to /127.0.0.1:9042 (Connection refused: /127.0.0.1:9042)
2016-07-07 11:34:31 DEBUG STATES:404 - Defuncting Connection[/127.0.0.1:9042-10, inFlight=0, closed=false] because: [/127.0.0.1] Cannot connect
2016-07-07 11:34:31 DEBUG STATES:108 - [/127.0.0.1:9042] Connection[/127.0.0.1:9042-10, inFlight=0, closed=false] failed, remaining = 0
2016-07-07 11:34:31 DEBUG Connection:629 - Connection[/127.0.0.1:9042-10, inFlight=0, closed=true] closing connection
2016-07-07 11:34:31 DEBUG Cluster:1802 - Aborting onDown because a reconnection is running on DOWN host /127.0.0.1:9042
2016-07-07 11:34:31 DEBUG Cluster:1872 - Failed reconnection to /127.0.0.1:9042 ([/127.0.0.1] Cannot connect), scheduling retry in 512000 milliseconds
2016-07-07 11:34:31 DEBUG STATES:196 - [/127.0.0.1:9042] next reconnection attempt in 512000 ms
I can't figure out where/what I need to configure on the driver side (no local client, it's just the driver) to correct this issue
My guess is that this is caused by configuration of the cassandra.yaml file on your cassandra node(s). The two main settings that would impact this are broadcast_rpc_address and rpc_address, from The cassandra.yaml configuration reference:
broadcast_rpc_address
(Default: unset) RPC address to broadcast to drivers and other Cassandra nodes. This cannot be set to 0.0.0.0. If blank, it is set to the value of the rpc_address or rpc_interface. If rpc_address or rpc_interfaceis set to 0.0.0.0, this property must be set.
rpc_address
(Default: localhost) The listen address for client connections (Thrift RPC service and native transport).
If you leave both of these to the defaults, localhost will be the default address cassandra will communicate to connect on.
After the driver is able to connect to a contact point, it queries the system.local and system.peers table of the contact point to determine which hosts to connect to, the addresses those tables communicate are from rpc_address/broadcast_rpc_address

Cherrypy Daemon shutdown fails

I've followed the Cherrypy daemon webapp skeleton Deploying CherryPy (daemon) which is great. But I've got a shutdown problem.
The server in question is using port 8082. When the shutdown comes from the init.d script it hits the webapp-cherryd equivalent and then throws errors.
XXX#mgmtdebian7:/etc/init.d# ./XXX stop
[11/Jul/2014:09:39:25] ENGINE Listening for SIGHUP.
[11/Jul/2014:09:39:25] ENGINE Listening for SIGTERM.
[11/Jul/2014:09:39:25] ENGINE Listening for SIGUSR1.
[11/Jul/2014:09:39:25] ENGINE Bus STARTING
[11/Jul/2014:09:39:25] ENGINE Started monitor thread 'Autoreloader'.
[11/Jul/2014:09:39:25] ENGINE Started monitor thread '_TimeoutMonitor'.
[11/Jul/2014:09:39:30] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0xe60e10>>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/process/wspbus.py", line 197, in publish
output.append(listener(*args, **kwargs))
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/_cpserver.py", line 151, in start
ServerAdapter.start(self)
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/process/servers.py", line 168, in start
wait_for_free_port(*self.bind_addr)
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/process/servers.py", line 412, in wait_for_free_port
raise IOError("Port %r not free on %r" % (port, host))
IOError: Port 8080 not free on '127.0.0.1'
[11/Jul/2014:09:39:30] ENGINE Shutting down due to error in start listener:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/process/wspbus.py", line 235, in start
self.publish('start')
File "/usr/local/lib/python2.7/dist-packages/CherryPy-3.2.4-py2.7.egg/cherrypy/process/wspbus.py", line 215, in publish
raise exc
ChannelFailures: IOError("Port 8080 not free on '127.0.0.1'",)
[11/Jul/2014:09:39:30] ENGINE Bus STOPPING
[11/Jul/2014:09:39:30] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('127.0.0.1', 8080)) already shut down
[11/Jul/2014:09:39:30] ENGINE Stopped thread '_TimeoutMonitor'.
[11/Jul/2014:09:39:30] ENGINE Stopped thread 'Autoreloader'.
[11/Jul/2014:09:39:30] ENGINE Bus STOPPED
[11/Jul/2014:09:39:30] ENGINE Bus EXITING
[11/Jul/2014:09:39:30] ENGINE Bus EXITED
XXX#mgmtdebian7:/etc/init.d#
From the surfing I've done thus far I've believe that the service is trying to restart in response to the SIGHUP signal and that it's picking up a default port of 8080 (which isn't & shouldn't be free) and therefore failing.
This leaves the service running - not what's wanted..
BTW my config that sets the port to 8082 is inside the module I load - rather than in a config file.
Thanks for any pointers in anticipation.
As the log clearly states CherryPy is attempting to start on 127.0.0.1:8080 and fails after 5 seconds. So you actually have a start up fail and likely it is because in my config that sets the port to 8082 is inside the module I load - rather than in a config file you don't set the port correctly and CherryPy uses default 8080.
Also I'd like to note that you shouldn't use Autoreloader in production. The cherrypy-webapp-skeleton init.d script sets production CherryPy environment which has Autoreloader off.
So when you do
cherrypy.tree.graph(app, "/")
you are by default creating a server instance on localhost:8080, but it doesnt actually get started until you call cherrypy.engine.start().
You are probably doing something like this
E.g. :
cherrypy.tree.graft(app, "/") # registers a server on localhost:8080
server = cherrypy._cpserver.Server() # registers a second server...
server.socket_host="0.0.0.0" # ..on 0.0.0.0 ...
server.socket_port = 5002 # ..with port 5002
server.thread_pool = 10
server.subscribe()
cherrypy.engine.start() #starts two server instances
cherrypy.engine.block()
will cause cherry py to start two server instances, one on localhost:8080, and another on 5002.
The answer is to instead do:
cherrypy.tree.graft(app, "/")
cherrypy.server.unsubscribe() # very important gets rid of default.
server = cherrypy._cpserver.Server()
server.socket_host="0.0.0.0"
server.socket_port = 5002
server.thread_pool = 10
server.subscribe()
cherrypy.engine.start() #now starts only one server instance
cherrypy.engine.block()
Your problem above is that you are staring two servers, and only one of them is falling over/erroring/closing, so port 8080 is still bound to local host and prevents a restart..