Argo Events Kafka triggers cannot parse message headers to enable distributed tracing - apache-kafka

TL;DR - Argo Events Kafka eventsource triggers do not currently parse headers of consumed Kafka message, which is needed to enable distributed tracing. I submitted a feature request (here) - if you face the same problem please upvote, and curious if anyone figured out a workaround.
====================================
Context
Common pattern of Argo Workflows we deploy are Kafka event-driven, asynchronous distributed workloads, e.g.:
Service "A" Kafka producer that emits message to topic
Argo Events eventsource Kafka trigger listening to that topic
Argo Workflow gets triggered, and post-processing...
... service "B" Kafka producer at end of workflow emits that work is done.
To monitor the entire system for user-centric metrics "how long did it take & where are the bottle necks", I'm looking to instrument distributed tracing from service "A" to service "B". We use Datadog as aggregator, with dd-trace.
Pattern I've seen is manual propagation of trace ctx via Kafka headers - by injecting headers to Kafka messages before emitting (similar to HTTP headers, with parent trace metadata), and receiving Consumer once done processing the message will then add child_span to that parent_span received from upstream.
ex) of above: https://newrelic.com/blog/how-to-relic/distributed-tracing-with-kafka
Issue
Argo-Events Kafka event source trigger does not parse any headers, only passing the body json for downstream Workflow to use at eventData.Body.
[source code]
Simplified views of my Argo Eventsource -> Trigger -> Workflow:
# eventsource/my-kafka-eventsource.yaml
apiVersion: argoproj.io/v1alpha1
kind: EventSource
spec:
kafka:
my-kafka-eventsource:
topic: <my-topic>
version: "2.5.0"
# sensors/trigger-my-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Sensor
spec:
dependencies:
- name: my-kafka-eventsource-dep
eventSourceName: my-kafka-eventsource
eventName: my-kafka-eventsource
triggers:
- template:
name: start-my-workflow
k8s:
operation: create
source:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
entrypoint: my-sick-workflow
arguments:
parameters:
- name: proto_message
value: needs to be overriden
# I would like to be able to add this
- name: msg_headers
value: needs to be overriden
templates:
- name: my-sick-workflow
dag:
tasks:
- name: my-sick-workflow
templateRef:
name: my-sick-workflow
template: my-sick-workflow
parameters:
# content/body of consumed message
- src:
dependencyName: my-kafka-eventsource-dep
dataKey: body
dest: spec.arguments.parameters.0.value
# I would like to do this - get msg.headers() if exists.
- src:
dependencyName: my-kafka-eventsource-dep
dataKey: headers
dest: spec.arguments.parameters.1.value
# templates/my-sick-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
spec:
templates:
- name: my-sick-workflow
container:
image: <image>
command: [ "python", "/main.py" ]
# I want to add the 2nd arg - msg_headers - here
args: [ "{{workflow.parameters.proto_message}}", "{{workflow.parameters.msg_headers}}" ]
# so that in my Workflow Dag step source code,
# I can access headers of Kafka msg from upstream by....
# body=sys.argv[1], headers=sys.argv[2]
Confluent-Kafka API docs on accessing message headers: [doc]
Q's
Has anyone found a workaround on passing tracing context from upstream to downstream service that travels between Kafka Producer<>Argo Events?
I considered changing my Argo-Workflows sensor trigger to HTTP trigger accepting payloads, by a new Kafka consumer listening for the message that is currently triggering my Argo Workflow --> then forward HTTP payload with parent trace metadata in headers.
it's anti-pattern to rest of my workflows, so I would like to avoid if there's a simpler solution.

As you pointed out, the only real workaround without forking some part of Argo Events, or implementing your own Source/Sensor yourself would be to use a Kafka Consumer (or Kafka Connect), and call a WebHook EventSource (or another, which can extract the information you need).

Related

Unable to update or delete existing argo events sensor and EventSource

Experiencing issue while modifying or deleting an existing argo events sensor.
Tried to modify a sensor
I tried to apply changes to an existing sensor.
But new changes are not taking effect. When it gets triggered, it is still using old triggers.
Tried to delete a sensor
Unable to delete. kubectl delete hangs forever. Only way is to delete whole namespace.
Using :
Argo-events version - v1.7.5
Kubernetes - v1.24.4+k3s1 ( testing in local - docker-desktop with K3d )
Since deleting everything & redoing is not an option when working in production environment, like to know if it's a known issue with argo-events or if I am doing something wrong.
As of release v1.7.5, there is a bug in default sensor & eventSource kubernetes resource yaml values.
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
....
finalizers:
- sensor-controller
....
It has finalizers as sensor-controller.
In v1.7.0+, argo events team has merged sensor controller & source controller into argo-events-controller-manager.
I believe, event sensor and event source are pointing to wrong controller
It should ideally be pointing to argo-events-controller
To resolve this issue till this bug is fixed in argo-events kubernetes charts:
Update your sensor & event source definitions to have finalizers as empty array.
# example sensor with empty finalizers
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: minio
finalizers: [] # <-- this one
spec:
dependencies:
- name: test-dep
eventSourceName: minio
eventName: example
triggers:
- template:
name: http-trigger
http:
url: http://http-server.argo-events.svc:8090/hello
payload:
- src:
dependencyName: test-dep
dataKey: notification.0.s3.bucket.name
dest: bucket
- src:
dependencyName: test-dep
contextKey: type
dest: type
method: POST
retryStrategy:
steps: 3
duration: 3s

KNative service not handling requests concurrently when using eventing with KafkaSource

Summary:
I'm trying to use KNative eventing to expose a simple web application via a Kafka topic. The server should be able to handle multiple requests concurrently, but, unfortunately, it seems to handle them sequentially when I send them via Kafka. When making simple HTTP requests directly to the service, though, the concurrency is working fine.
Setup:
The setup only uses a KafkaSource which points to my KNative Service, and is using a Kafka instance deployed using the bitnami/kafka helm chart.
The version I'm using is v1.7.1 for KNative serving and eventing, and v1.7.0 for the Kafka eventing integration (from knative-sandbox/eventing-kafka).
Code:
The service I'm trying to deploy is a python FastAPI application that, upon receiving a request (with an ID of sorts), logs the received request, sleeps for 5 seconds, then returns a dummy message:
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
level=logging.DEBUG, datefmt="%Y-%m-%d %H:%M:%S",
)
app = FastAPI()
class Item(BaseModel):
id: str
#app.post("/")
async def root(item: Item):
logging.debug(f"Request received with ID: {item.id}")
await asyncio.sleep(5)
logging.debug(f"Request complete for ID: {item.id}")
return {"message": "Hello World"}
The app is served using uvicorn:
FROM python:3.9-slim
RUN pip install fastapi uvicorn
ADD main.py .
ENTRYPOINT uvicorn --host 0.0.0.0 --port 8877 main:app
The service deployment spec shows that I'm setting a containerConcurrency value that's greater than 1:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: concurrency-test
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "5"
spec:
containerConcurrency: 5
containers:
- name: app
image: dev.local/concurrency-test:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8877
---
apiVersion: sources.knative.dev/v1beta1
kind: KafkaSource
metadata:
name: concurrency-test
spec:
consumerGroup: concurrency-test-group
bootstrapServers:
- kafka.default.svc.cluster.local:9092
topics:
- concurrency-test-requests
sink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: concurrency-test
Note: I also tried with spec.consumers: 2 in the KafkaSource but the behavior was the same.
Logs:
When sending two concurrent requests to the service directly with HTTP, the logs to look like this (both requests finish within 6 seconds, so concurrency is in effect):
2022-09-12 02:14:36 DEBUG Request received with ID: abc
2022-09-12 02:14:37 DEBUG Request received with ID: def
2022-09-12 02:14:41 DEBUG Request complete for ID: abc
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
2022-09-12 02:14:42 DEBUG Request complete for ID: def
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
When sending requests via Kafka, though, the logs look like this (the requests are being processed one after the other):
2022-09-12 02:14:55 DEBUG Request received with ID: 111
2022-09-12 02:15:00 DEBUG Request complete for ID: 111
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
2022-09-12 02:15:00 DEBUG Request received with ID: 222
2022-09-12 02:15:05 DEBUG Request complete for ID: 222
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
Please let me know if this sequential request handling is the expected behavior when using eventing with just a KafkaSource, and I hope there are ways for enabling concurrency in this setup.
Kafka provides ordering within a partition (the implementation is a distributed log). You may need to change the number of partitions on your Kafka topic to achieve higher parallelism; you may be able to also use the spec.consumers value to increase the throughput (untested).
I'd also encourage filing an issue in the eventing-kafka repo with your problem and any additional knobs if there is other behavior you're looking for.

Argo Event webhook authentication with Github

I'm trying to integrate the GitHub repo with the Argo Event Source webhook as an example (link). When the configured from the Github event returns an error.
'Invalid Authorization Header'.
Code:
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: ci-pipeline-webhook
spec:
service:
ports:
- port: 12000
targetPort: 12000
webhook:
start-pipeline:
port: "12000"
endpoint: /start-pipeline
method: POST
authSecret:
name: my-webhook-token
key: my-token
If you want to use a secure GitHub webhook as an event source, you will need to use the GitHub event source type. GitHub webhooks send a special authorization header, X-Hub-Signature/X-Hub-Signature-256, that includes as hashed value of the webhook secret. The "regular" webhook event source expects a standard Bearer token with an authorization header in the form of "Authorization: Bearer <webhook-secret>".
You can read more about GitHub webhook delivery headers here. You can then compare that to the Argo Events webhook event source authentication documentation here.
There are basically two options when creating the GitHub webhook event source.
Provide GitHub API credentials in a Kubernetes secret so Argo Events can make the API call to GitHub to create the webhook on your behalf.
Omit the GitHub API credentials in the EventSource spec and create the webhook yourself either manually or through whichever means you normally create a webhook (Terraform, scripted API calls, etc).
Here is an example for the second option:
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: github-events
namespace: my-namespace
spec:
service:
ports:
- name: http
port: 12000
targetPort: 12000
github:
default:
owner: my-github-org-or-username
repository: my-github-repo-name
webhook:
url: https://my-argo-events-server-fqdn
endpoint: /push
port: "12000"
method: POST
events:
- "*"
webhookSecret:
name: my-secret-name
key: my-secret-key
insecure: false
active: true
contentType: "json"

Knative Eventing Dead Letter Sink Not Triggered

I've got sequences working in my Knative environment. We're trying to configure and confirm the DLQ/Dead Letter Sink works so we can write tests and things against sequences. I can't for the life of my get Knative to send anything to the Dead Letter Sink. I've approached this two ways, the first was setting up a Broker, Trigger, Services and a Sequence. I've defined in the Broker a service to use for the DLQ. I then setup a service in the sequence to intentionally returned a non-200 status. When I view the logs for the channel dispatcher in the knative-eventing namespace, I believe what I read is that it thinks there was a failure.
I read some things about the default MT Broker maybe not handling DLQ correctly so then I installed Kafka. Got that all working and essentially, it appears to do the same thing.
I started to wonder, ok, maybe within a sequence you can't do DLQ. After all the documentation only talks about DLQ with subscriptions and brokers, and maybe Knative believes that the message was successfully delivered from the broker to the sequence, even if it dies within the sequence. So I manually setup and channel and a subscription and sent the data straight to the channel and again, what I got was essentially the same thing, which is:
The sequence will stop on whatever step doesn't return a 2XX status code, but nothing gets sent to the DLQ. I even made the subscription go straight to the service (instead of a sequence) and that service returned a 500 and still nothing to the DLQ.
The log item below is from the channel dispatcher pod running in the knative-eventing namespace. It basically looks the same with In memory channel or Kafka, i.e. expected 2xx got 500.
{"level":"info","ts":"2021-11-30T16:01:05.313Z","logger":"kafkachannel-dispatcher","caller":"consumer/consumer_handler.go:162","msg":"Failure while handling a message","knative.dev/pod":"kafka-ch-dispatcher-5bb8f84976-rpd87","knative.dev/controller":"knative.dev.eventing-kafka.pkg.channel.consolidated.reconciler.dispatcher.Reconciler","knative.dev/kind":"messaging.knative.dev.KafkaChannel","knative.dev/traceid":"957c394a-1636-44ad-b024-fb0dde9c8440","knative.dev/key":"kafka/test-sequence-kn-sequence-0","topic":"knative-messaging-kafka.kafka.test-sequence-kn-sequence-0","partition":0,"offset":4,"error":"unable to complete request to http://cetf.kafka.svc.cluster.local: unexpected HTTP response, expected 2xx, got 500"}
{"level":"warn","ts":"2021-11-30T16:01:05.313Z","logger":"kafkachannel-dispatcher","caller":"dispatcher/dispatcher.go:314","msg":"Error in consumer group","knative.dev/pod":"kafka-ch-dispatcher-5bb8f84976-rpd87","error":"unable to complete request to http://cetf.kafka.svc.cluster.local: unexpected HTTP response, expected 2xx, got 500"}
Notes on setup. I deployed literally everything to the same namespace for testing. I followed the guide here essentially to setup my broker when doing the broker/trigger and for deploying Kafka. My broker looked like this:
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
annotations:
# case-sensitive
eventing.knative.dev/broker.class: Kafka
name: default
namespace: kafka
spec:
# Configuration specific to this broker.
config:
apiVersion: v1
kind: ConfigMap
name: kafka-broker-config
namespace: knative-eventing
# Optional dead letter sink, you can specify either:
# - deadLetterSink.ref, which is a reference to a Callable
# - deadLetterSink.uri, which is an absolute URI to a Callable (It can potentially be
out of the Kubernetes cluster)
delivery:
deadLetterSink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq
namespace: kafka
When I manually created the subscription and channel my subscription looked like this:
apiVersion: messaging.knative.dev/v1
kind: Subscription
metadata:
name: test-sub # Name of the Subscription.
namespace: kafka
spec:
channel:
apiVersion: messaging.knative.dev/v1beta1
kind: KafkaChannel
name: test-channel
delivery:
deadLetterSink:
backoffDelay: PT1S
backoffPolicy: linear
retry: 1
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq
namespace: kafka
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: cetf
No matter what I do I NEVER see the dlq pod spin up. I've adjusted retry stuff, waited and waited, used the default channel/broker, Kafka, etc. I simply cannot see the pod ever run. Is there something I'm missing, what on earth could be wrong? I can set the subscriber to be a junk URI and then the DLQ pod spins up, but shouldn't it also spin up if the service it sends events to returns error codes?
Can anyone provide a couple of very basic YAML files to deploy the simplest version of a working DLQ to test with?
there was some issue with Dead Letter Sinks not being propagated at pre-GA releases. Can you make sure you are using Knative 1.0?
This is working for me as expected using the inmemory channel:
https://gist.github.com/odacremolbap/f6ce029caf4fa6fbb3cc1e829f188788
curl producing cloud events to a broker
broker with DLS configured to an event-display
event display service as DLS receiver
trigger from broker to a replier service
replier service (can ack and nack depending on the incoming event)
I never found an example of this in the docs, but the API docs for the SequenceStep does show a delivery property. Which, when assigned, uses the DLQ.
steps:
- ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: service-step
delivery:
# DLS to an event-display service
deadLetterSink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq-service
namespace: ns-name
It seems odd to have to specify a delivery for EVERY step and not just the sequence as a whole.

Kafka producer and consumer on different Kubernetes clusters

Would Kafka need to be installed on the consumer cluster?
Presently the same cluster YAML configuration is:
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: sample-topic
spec:
type: bindings.kafka
version: v1
metadata:
# Kafka broker connection setting
- name: brokers
value: dapr-kafka.kafka:9092
# consumer configuration: topic and consumer group
- name: topics
value: sample
- name: consumerGroup
value: group1
# publisher configuration: topic
- name: publishTopic
value: sample
- name: authRequired
value: "false"
On different clusters, does each cluster require only either "name: publishTopic" or "name: consumerGroup" and not the other?
I'm not familiar with Dapr, but Kafka does not need installed in k8s, or any specific location. Your only requirement should be client connectivity to that bootstrap-servers list.
According to the Kafka Binding spec, consumerGroup is for incoming events, and publishTopic is for outgoing, so two different use cases, although one app should be able to have both event types. If the app only uses incoming or outgoing events, then use the appropriate binding for that case.