KNative service not handling requests concurrently when using eventing with KafkaSource - kubernetes

Summary:
I'm trying to use KNative eventing to expose a simple web application via a Kafka topic. The server should be able to handle multiple requests concurrently, but, unfortunately, it seems to handle them sequentially when I send them via Kafka. When making simple HTTP requests directly to the service, though, the concurrency is working fine.
Setup:
The setup only uses a KafkaSource which points to my KNative Service, and is using a Kafka instance deployed using the bitnami/kafka helm chart.
The version I'm using is v1.7.1 for KNative serving and eventing, and v1.7.0 for the Kafka eventing integration (from knative-sandbox/eventing-kafka).
Code:
The service I'm trying to deploy is a python FastAPI application that, upon receiving a request (with an ID of sorts), logs the received request, sleeps for 5 seconds, then returns a dummy message:
import asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)-8s %(message)s",
level=logging.DEBUG, datefmt="%Y-%m-%d %H:%M:%S",
)
app = FastAPI()
class Item(BaseModel):
id: str
#app.post("/")
async def root(item: Item):
logging.debug(f"Request received with ID: {item.id}")
await asyncio.sleep(5)
logging.debug(f"Request complete for ID: {item.id}")
return {"message": "Hello World"}
The app is served using uvicorn:
FROM python:3.9-slim
RUN pip install fastapi uvicorn
ADD main.py .
ENTRYPOINT uvicorn --host 0.0.0.0 --port 8877 main:app
The service deployment spec shows that I'm setting a containerConcurrency value that's greater than 1:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: concurrency-test
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "5"
spec:
containerConcurrency: 5
containers:
- name: app
image: dev.local/concurrency-test:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8877
---
apiVersion: sources.knative.dev/v1beta1
kind: KafkaSource
metadata:
name: concurrency-test
spec:
consumerGroup: concurrency-test-group
bootstrapServers:
- kafka.default.svc.cluster.local:9092
topics:
- concurrency-test-requests
sink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: concurrency-test
Note: I also tried with spec.consumers: 2 in the KafkaSource but the behavior was the same.
Logs:
When sending two concurrent requests to the service directly with HTTP, the logs to look like this (both requests finish within 6 seconds, so concurrency is in effect):
2022-09-12 02:14:36 DEBUG Request received with ID: abc
2022-09-12 02:14:37 DEBUG Request received with ID: def
2022-09-12 02:14:41 DEBUG Request complete for ID: abc
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
2022-09-12 02:14:42 DEBUG Request complete for ID: def
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
When sending requests via Kafka, though, the logs look like this (the requests are being processed one after the other):
2022-09-12 02:14:55 DEBUG Request received with ID: 111
2022-09-12 02:15:00 DEBUG Request complete for ID: 111
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
2022-09-12 02:15:00 DEBUG Request received with ID: 222
2022-09-12 02:15:05 DEBUG Request complete for ID: 222
INFO: 10.42.0.7:0 - "POST / HTTP/1.1" 200 OK
Please let me know if this sequential request handling is the expected behavior when using eventing with just a KafkaSource, and I hope there are ways for enabling concurrency in this setup.

Kafka provides ordering within a partition (the implementation is a distributed log). You may need to change the number of partitions on your Kafka topic to achieve higher parallelism; you may be able to also use the spec.consumers value to increase the throughput (untested).
I'd also encourage filing an issue in the eventing-kafka repo with your problem and any additional knobs if there is other behavior you're looking for.

Related

Argo Events Kafka triggers cannot parse message headers to enable distributed tracing

TL;DR - Argo Events Kafka eventsource triggers do not currently parse headers of consumed Kafka message, which is needed to enable distributed tracing. I submitted a feature request (here) - if you face the same problem please upvote, and curious if anyone figured out a workaround.
====================================
Context
Common pattern of Argo Workflows we deploy are Kafka event-driven, asynchronous distributed workloads, e.g.:
Service "A" Kafka producer that emits message to topic
Argo Events eventsource Kafka trigger listening to that topic
Argo Workflow gets triggered, and post-processing...
... service "B" Kafka producer at end of workflow emits that work is done.
To monitor the entire system for user-centric metrics "how long did it take & where are the bottle necks", I'm looking to instrument distributed tracing from service "A" to service "B". We use Datadog as aggregator, with dd-trace.
Pattern I've seen is manual propagation of trace ctx via Kafka headers - by injecting headers to Kafka messages before emitting (similar to HTTP headers, with parent trace metadata), and receiving Consumer once done processing the message will then add child_span to that parent_span received from upstream.
ex) of above: https://newrelic.com/blog/how-to-relic/distributed-tracing-with-kafka
Issue
Argo-Events Kafka event source trigger does not parse any headers, only passing the body json for downstream Workflow to use at eventData.Body.
[source code]
Simplified views of my Argo Eventsource -> Trigger -> Workflow:
# eventsource/my-kafka-eventsource.yaml
apiVersion: argoproj.io/v1alpha1
kind: EventSource
spec:
kafka:
my-kafka-eventsource:
topic: <my-topic>
version: "2.5.0"
# sensors/trigger-my-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Sensor
spec:
dependencies:
- name: my-kafka-eventsource-dep
eventSourceName: my-kafka-eventsource
eventName: my-kafka-eventsource
triggers:
- template:
name: start-my-workflow
k8s:
operation: create
source:
resource:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
spec:
entrypoint: my-sick-workflow
arguments:
parameters:
- name: proto_message
value: needs to be overriden
# I would like to be able to add this
- name: msg_headers
value: needs to be overriden
templates:
- name: my-sick-workflow
dag:
tasks:
- name: my-sick-workflow
templateRef:
name: my-sick-workflow
template: my-sick-workflow
parameters:
# content/body of consumed message
- src:
dependencyName: my-kafka-eventsource-dep
dataKey: body
dest: spec.arguments.parameters.0.value
# I would like to do this - get msg.headers() if exists.
- src:
dependencyName: my-kafka-eventsource-dep
dataKey: headers
dest: spec.arguments.parameters.1.value
# templates/my-sick-workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
spec:
templates:
- name: my-sick-workflow
container:
image: <image>
command: [ "python", "/main.py" ]
# I want to add the 2nd arg - msg_headers - here
args: [ "{{workflow.parameters.proto_message}}", "{{workflow.parameters.msg_headers}}" ]
# so that in my Workflow Dag step source code,
# I can access headers of Kafka msg from upstream by....
# body=sys.argv[1], headers=sys.argv[2]
Confluent-Kafka API docs on accessing message headers: [doc]
Q's
Has anyone found a workaround on passing tracing context from upstream to downstream service that travels between Kafka Producer<>Argo Events?
I considered changing my Argo-Workflows sensor trigger to HTTP trigger accepting payloads, by a new Kafka consumer listening for the message that is currently triggering my Argo Workflow --> then forward HTTP payload with parent trace metadata in headers.
it's anti-pattern to rest of my workflows, so I would like to avoid if there's a simpler solution.
As you pointed out, the only real workaround without forking some part of Argo Events, or implementing your own Source/Sensor yourself would be to use a Kafka Consumer (or Kafka Connect), and call a WebHook EventSource (or another, which can extract the information you need).

I can't find a way to provide a "graceful shutdown" in Nest microservices, in particular using NATS

Hello everyone I can't find a way to provide a "graceful shutdown" in Nest microservices, in particular using NATS.
Expected behavior:
The application in kubernetes received a 'SIGTERM' signal.
stops listening for new incoming requests.
service of accepted requests is completed and a response is given.
the application closes all connections and shuts down.
You can use the Kubernetes spec config to set the graceful shutdown for POD: terminationGracePeriodSeconds
Default value of terminationGracePeriodSeconds is 30 seconds
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
template:
spec:
containers:
- name: test
image: ...
terminationGracePeriodSeconds: 60
Read best practise : https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

Knative Eventing Dead Letter Sink Not Triggered

I've got sequences working in my Knative environment. We're trying to configure and confirm the DLQ/Dead Letter Sink works so we can write tests and things against sequences. I can't for the life of my get Knative to send anything to the Dead Letter Sink. I've approached this two ways, the first was setting up a Broker, Trigger, Services and a Sequence. I've defined in the Broker a service to use for the DLQ. I then setup a service in the sequence to intentionally returned a non-200 status. When I view the logs for the channel dispatcher in the knative-eventing namespace, I believe what I read is that it thinks there was a failure.
I read some things about the default MT Broker maybe not handling DLQ correctly so then I installed Kafka. Got that all working and essentially, it appears to do the same thing.
I started to wonder, ok, maybe within a sequence you can't do DLQ. After all the documentation only talks about DLQ with subscriptions and brokers, and maybe Knative believes that the message was successfully delivered from the broker to the sequence, even if it dies within the sequence. So I manually setup and channel and a subscription and sent the data straight to the channel and again, what I got was essentially the same thing, which is:
The sequence will stop on whatever step doesn't return a 2XX status code, but nothing gets sent to the DLQ. I even made the subscription go straight to the service (instead of a sequence) and that service returned a 500 and still nothing to the DLQ.
The log item below is from the channel dispatcher pod running in the knative-eventing namespace. It basically looks the same with In memory channel or Kafka, i.e. expected 2xx got 500.
{"level":"info","ts":"2021-11-30T16:01:05.313Z","logger":"kafkachannel-dispatcher","caller":"consumer/consumer_handler.go:162","msg":"Failure while handling a message","knative.dev/pod":"kafka-ch-dispatcher-5bb8f84976-rpd87","knative.dev/controller":"knative.dev.eventing-kafka.pkg.channel.consolidated.reconciler.dispatcher.Reconciler","knative.dev/kind":"messaging.knative.dev.KafkaChannel","knative.dev/traceid":"957c394a-1636-44ad-b024-fb0dde9c8440","knative.dev/key":"kafka/test-sequence-kn-sequence-0","topic":"knative-messaging-kafka.kafka.test-sequence-kn-sequence-0","partition":0,"offset":4,"error":"unable to complete request to http://cetf.kafka.svc.cluster.local: unexpected HTTP response, expected 2xx, got 500"}
{"level":"warn","ts":"2021-11-30T16:01:05.313Z","logger":"kafkachannel-dispatcher","caller":"dispatcher/dispatcher.go:314","msg":"Error in consumer group","knative.dev/pod":"kafka-ch-dispatcher-5bb8f84976-rpd87","error":"unable to complete request to http://cetf.kafka.svc.cluster.local: unexpected HTTP response, expected 2xx, got 500"}
Notes on setup. I deployed literally everything to the same namespace for testing. I followed the guide here essentially to setup my broker when doing the broker/trigger and for deploying Kafka. My broker looked like this:
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
annotations:
# case-sensitive
eventing.knative.dev/broker.class: Kafka
name: default
namespace: kafka
spec:
# Configuration specific to this broker.
config:
apiVersion: v1
kind: ConfigMap
name: kafka-broker-config
namespace: knative-eventing
# Optional dead letter sink, you can specify either:
# - deadLetterSink.ref, which is a reference to a Callable
# - deadLetterSink.uri, which is an absolute URI to a Callable (It can potentially be
out of the Kubernetes cluster)
delivery:
deadLetterSink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq
namespace: kafka
When I manually created the subscription and channel my subscription looked like this:
apiVersion: messaging.knative.dev/v1
kind: Subscription
metadata:
name: test-sub # Name of the Subscription.
namespace: kafka
spec:
channel:
apiVersion: messaging.knative.dev/v1beta1
kind: KafkaChannel
name: test-channel
delivery:
deadLetterSink:
backoffDelay: PT1S
backoffPolicy: linear
retry: 1
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq
namespace: kafka
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: cetf
No matter what I do I NEVER see the dlq pod spin up. I've adjusted retry stuff, waited and waited, used the default channel/broker, Kafka, etc. I simply cannot see the pod ever run. Is there something I'm missing, what on earth could be wrong? I can set the subscriber to be a junk URI and then the DLQ pod spins up, but shouldn't it also spin up if the service it sends events to returns error codes?
Can anyone provide a couple of very basic YAML files to deploy the simplest version of a working DLQ to test with?
there was some issue with Dead Letter Sinks not being propagated at pre-GA releases. Can you make sure you are using Knative 1.0?
This is working for me as expected using the inmemory channel:
https://gist.github.com/odacremolbap/f6ce029caf4fa6fbb3cc1e829f188788
curl producing cloud events to a broker
broker with DLS configured to an event-display
event display service as DLS receiver
trigger from broker to a replier service
replier service (can ack and nack depending on the incoming event)
I never found an example of this in the docs, but the API docs for the SequenceStep does show a delivery property. Which, when assigned, uses the DLQ.
steps:
- ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: service-step
delivery:
# DLS to an event-display service
deadLetterSink:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: dlq-service
namespace: ns-name
It seems odd to have to specify a delivery for EVERY step and not just the sequence as a whole.

Disable Istio default retry strategy (at least on POST requests)

I have an application (microservices-based) running on kubernets with Istio 1.7.4
The microservices has its own mechanisms of transaction compensation on integration failures.
But Istio is retrying requests, when some integrations has 503 status code responses. I need to disabled it (at least on POST, which is non-idenpontent).
And let the application take care of it.
But I've tried so many ways without success. Can someone help me?
Documentation
From Istio Retries documentation: Default retry is hardcoded and it's value equal to 2.
The interval between retries (25ms+) is variable and determined
automatically by Istio, preventing the called service from being
overwhelmed with requests. The default retry behavior for HTTP
requests is to retry twice before returning the error.
Btw, it was initially 10, but decreased to 2 in Enable retries for specific status codes and reduce num retries to 2 commit.
workaround is to use virtual services
you can adjust your retry settings on a per-service basis in virtual
services without having to touch your service code. You can also
further refine your retry behavior by adding per-retry timeouts,
specifying the amount of time you want to wait for each retry attempt
to successfully connect to the service.
Examples
The following example configures a maximum of 3 retries to connect to this service subset after an initial call failure, each with a 2 second timeout.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
subset: v1
retries:
attempts: 3
perTryTimeout: 2s
Your case. Disabling retries. Taken from Disable globally the default retry policy:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: no-retries-for-one-service
spec:
hosts:
- one-service.default.svc.cluster.local
http:
- retries:
attempts: 0
route:
- destination:
host: one-service.default.svc.cluster.local

How to dynamically scale a service in Openshift ? A Challenging scenario

I'm currently trying to deploy a backend service API for my application in Openshift, which needs to be scalable such that each of the request has to run in a new pod.
Service will take 5 minutes to serve single request.
I have to hit the service for 700 times.
Is there a way I can create 700 pods to serve the 700 request and scaled down it to 1 after all the requests are completed ?
Start of the application:
1 pod <- 700 requests
Serving:
700 pod serves one request each
End of the application:
1 pod
Autoscaling in Kubernetes relies on metrics. From what I know Openshift supports CPU and Memory utilization.
But I don't think this is what you are looking for.
I think you should be looking into Jobs - Run to Completion.
Each request will spawn a new Job which will run until it's completed.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
This will run a job which computes π to 2000 places and prints it out.