Netty starts channels but does not read from them in kubernetes - kubernetes

netty-all:4.1.48.Final
I am having a cryptic issue with Netty that seems to only show up in Kubernetes. I have a clone of the project running on a cloud instance with less resources that does not have this issue. Both projects receive the same amount of traffic (I am resending the same traffic from a third provider to both Netty servers).
In kubernetes, every time a channel is opened (I send a message) I increment my session counter. Every time the channel reads data, I increment a read counter. I am sending data every time so I would expect to see at the very least one read for every session (more if the data were long enough) but not less. The counters drift apart rather smoothly until the amount of reads stays around half of the amount of opened sessions.
Is there any way for me to diagnose this issue? I have written the barebones netty server I am using (with the configuration, including an idle timer). Am I blocking Netty resources?
class Server {
private val bossGroup = NioEventLoopGroup()
private val workerGroup = NioEventLoopGroup()
fun start() {
ServerBootstrap()
.group(bossGroup, workerGroup)
.option(ChannelOption.SO_REUSEADDR, true)
.option(ChannelOption.AUTO_CLOSE, false)
.channel(NioServerSocketChannel::class.java)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(ChannelOption.TCP_NODELAY, true)
.childHandler(object : ChannelInitializer<SocketChannel>() {
override fun initChannel(channel: SocketChannel) {
val idleTimeTrigger = 1
val idleStateHandler = IdleStateHandler(0, 0, idleTimeTrigger)
channel
.pipeline()
.addLast("idleStateHandler", idleStateHandler)
.addLast(Session(idleTimeTrigger))
}
})
.bind(8888)
.sync()
.channel()
.closeFuture()
.sync()
}
}
class Session(
private val idleTimeTrigger: Int,
) : ChannelInboundHandlerAdapter() {
// session counter
val idleTimeout = 10
var idleTickCounter = 0L
override fun channelRead(ctx: ChannelHandlerContext, msg: Any) {
// read counter is less than session counter... HUH????
this.idleTickCounter = 0
try {
val data = (msg as ByteBuf).toString(CharsetUtil.UTF_8)
// ... do my stuff ..
// output counter is less than session counter
} finally {
ReferenceCountUtil.release(msg)
}
}
override fun userEventTriggered(ctx: ChannelHandlerContext, evt: Any) {
this.idleTickCounter++
val idleTime = idleTimeTrigger * idleTickCounter
if (idleTime > idleTimeout) {
// idle timeout counter is always 0
ctx.close()
}
super.userEventTriggered(ctx, evt)
}
override fun exceptionCaught(ctx: ChannelHandlerContext, cause: Throwable) {
// error counter is always 0
ctx.close()
}
}
The output is being passed to a rabbit AMQP client and sent to a queue. I don't know if this is relevant (with regards to resource usage) but the AMQP client uses Jetty

Related

Where's the bottleneck when I wait for a Kafka message then return a value in Actix Web?

I am trying to communicate between 2 microservices written in Rust and Node.js using Kafka.
I'm using actix-web as web framework and rdkafka as Kafka client for Rust. On the Node.js side, it queries stuff from the database and returns it as JSON to the Rust server via Kafka.
The flow:
Request -> Actix Web -> Kafka -> Node -> Kafka -> Actix Web -> Response
The logic is the request hits an endpoint on Actix Web, then creates a message to request something to another micro-service and waits until it sends back (verify by Kafka message key), and returns it to the user as an HTTP response.
I got it to work, but the performance is very slow (I am stress-testing with wrk).
I'm not sure why it's performing slow but as I was digging down, I found that if I add a delay on the Node.js side for 5 seconds and I create 2 requests to actix-web where the requests are different by a second, it will respond with a 5 and 10-second delay.
The benchmark is around 3k requests per second, using the following command:
wrk http://localhost:8080 -d 20s -t 2 -c 200
This makes me guess that something might be blocking the thread for each request.
Here is the source code and the repo:
use std::{
sync::Arc,
time::{
Duration,
Instant
}
};
use actix_web::{
App,
HttpServer,
get,
rt,
web::Data
};
use futures::TryStreamExt;
use tokio::time::sleep;
use num_cpus;
use rand::{
distributions::Alphanumeric,
Rng
};
use rdkafka::{
ClientConfig,
Message,
consumer::{
Consumer,
StreamConsumer
},
producer::{
FutureProducer,
FutureRecord
}
};
const TOPIC: &'static str = "exp-queue_general-5";
#[derive(Clone)]
pub struct AppState {
pub producer: Arc<FutureProducer>,
pub receiver: flume::Receiver<String>
}
fn generate_key() -> String {
rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect()
}
#[get("/")]
async fn landing(state: Data<AppState>) -> String {
let key = generate_key();
let t1 = Instant::now();
let producer = &state.producer;
let receiver = &state.receiver;
producer
.send(
FutureRecord::to(&format!("{}-forth", TOPIC))
.key(&key)
.payload("Hello From Rust"),
Duration::from_secs(8)
)
.await
.expect("Unable to send message");
println!("Producer take {} ms", t1.elapsed().as_millis());
let t2 = Instant::now();
let value = receiver
.recv()
.unwrap_or("".to_owned());
println!("Receiver take {} ms", t2.elapsed().as_millis());
println!("Process take {} ms\n", t1.elapsed().as_millis());
value
}
#[get("/status")]
async fn heartbeat() -> &'static str {
// ? Concurrency delay check
sleep(Duration::from_secs(1)).await;
"Working"
}
#[actix_web::main]
async fn main() -> std::io::Result<()> {
// ? Assume that the whole node is just Rust instance
let mut cpus = num_cpus::get() / 2 - 1;
if cpus < 1 {
cpus = 1;
}
println!("Cpus {}", cpus);
let producer: FutureProducer = ClientConfig::new()
.set("bootstrap.servers", "localhost:9092")
.set("linger.ms", "25")
.set("queue.buffering.max.messages", "1000000")
.set("queue.buffering.max.ms", "25")
.set("compression.type", "lz4")
.set("retries", "40000")
.set("retries", "0")
.set("message.timeout.ms", "8000")
.create()
.expect("Kafka config");
let (tx, rx) = flume::unbounded::<String>();
rt::spawn(async move {
let consumer: StreamConsumer = ClientConfig::new()
.set("bootstrap.servers", "localhost:9092")
.set("group.id", &format!("{}-back", TOPIC))
.set("queued.min.messages", "200000")
.set("fetch.error.backoff.ms", "250")
.set("socket.blocking.max.ms", "500")
.create()
.expect("Kafka config");
consumer
.subscribe(&vec![format!("{}-back", TOPIC).as_ref()])
.expect("Can't subscribe");
consumer
.stream()
.try_for_each_concurrent(
cpus,
|message| {
let txx = tx.clone();
async move {
let result = String::from_utf8_lossy(
message
.payload()
.unwrap_or("Error serializing".as_bytes())
).to_string();
txx.send(result).expect("Tx not sending");
Ok(())
}
}
)
.await
.expect("Error reading stream");
});
let state = AppState {
producer: Arc::new(producer),
receiver: rx
};
HttpServer::new(move || {
App::new()
.app_data(Data::new(state.clone()))
.service(landing)
.service(heartbeat)
})
.workers(cpus)
.bind("0.0.0.0:8080")?
.run()
.await
}
I found some solved issues on GitHub which recommended using actors instead which I also did as a separate branch.
This has worse performance than the main branch, performing around 200-300 requests per second.
I don't know where the bottleneck is or what's the thing that blocking the request.

Stop Flink Kafka consumer task programmatically

I'm using Kafka consumer with Flink 1.9 (in Scala 2.12), and facing the following problem (similar to this question): the consumer should stop fetching data (and finish the task) when no new messages are received for a specific amount of time (since the stream is potentially infinite, so there is no "end-of-stream" message in the topic itself).
I've tried to use ProcessFunction which calls consumer.close(), but this did not help (consumer continues to run). Throwing an exception in ProcessFunction kills the job completely, which is not what I want (since the job consists of several stages, which are canceled after throwing an exception). Here is my ProcessFunction:
class TimeOutFunction( // delay after which an alert flag is thrown
val timeOut: Long, consumer: FlinkKafkaConsumer[Row]
) extends ProcessFunction[Row, Row] {
// state to remember the last timer set
private var lastTimer: ValueState[Long] = _
override def open(conf: Configuration): Unit = { // setup timer state
val lastTimerDesc = new ValueStateDescriptor[Long]("lastTimer", classOf[Long])
lastTimer = getRuntimeContext.getState(lastTimerDesc)
}
override def processElement(value: Row, ctx: ProcessFunction[Row, Row]#Context, out: Collector[Row]): Unit = { // get current time and compute timeout time
val currentTime = ctx.timerService.currentProcessingTime
val timeoutTime = currentTime + timeOut
// register timer for timeout time
ctx.timerService.registerProcessingTimeTimer(timeoutTime)
// remember timeout time
lastTimer.update(timeoutTime)
// throughput the event
out.collect(value)
}
override def onTimer(timestamp: Long, ctx: ProcessFunction[Row, Row]#OnTimerContext, out: Collector[Row]): Unit = {
// check if this was the last timer we registered
if (timestamp == lastTimer.value) {
// it was, so no data was received afterwards.
// stop the consumer.
consumer.close()
}
}
}
The isEndOfStream() method on a deserialization schema is also no good, since it requires nextElement (and my case is kind of vice-versa, since the stream should stop when there is no next element for some time).
So, there is a way to do this (preferably without subclassing FlinkKafkaConsumer and/or using reflection)?

Akka HTTP Streaming server kills connection after last message from queue

I have a pretty simple app that consists of a Kafka consumer sitting behind an Akka HTTP streaming server. Upon receiving a request, the server starts up a new consumer for the specified user and begins reading messages from a queue:
def consumer(consumerGroup: String, from: Int) = {
val topicsAndDate = Subscriptions.assignmentOffsetsForTimes(partitions.map(_ -> (System.currentTimeMillis() - from)): _*)
Consumer.plainSource[String, GenericRecord](consumerSettings.withGroupId(consumerGroup), topicsAndDate)
.map(record => record.timestamp() -> messageFormat.from(record.value()))
.map {
//convert to json
}
}
def routes: Route = Route.seal(
pathSingleSlash {
complete(HttpEntity(ContentTypes.`text/html(UTF-8)`, "Say hello to akka-http"))
} ~
path("stream") {
//some logic to validate user
log.info("Received request from {} with 'from'={}", user, from)
complete(consumer(user, from))
})
startServer("0.0.0.0", 8080)
The service works fine until the consumer has reached the latest message on the queue. Sixty seconds after this latest message has been returned, the connection to the server is killed every time. I want to keep the connection alive as the queue is populated with more messages every couple of minutes.
I have tried various different config options, but none seem to give the desired outcome. My current config looks like this:
akka {
http {
client {
idle-timeout = 300s
}
server {
idle-timeout = 600s
linger-timeout = 15 min
}
host-connection-pool {
max-retries = 30
max-connections = 20
max-open-requests = 32
connecting-timeout = 60s
client {
idle-timeout = 300s
}
}
}
}
I have also tried using the server.websocket.periodic-keep-alive-max-idle = 1 second setting, but it doesn't seem to make any difference.
Let me know if I need to supply any more relevant info.

Long living service with coroutines

I want to create a long living service that can handle events.
It receives events via postEvent, stores it in repository (with underlying database) and send batch of them api when there are enough events.
Also I'd like to shut it down on demand.
Furthermore I would like to test this service.
This is what I came up so far. Currently I'm struggling with unit testing it.
Either database is shut down prematurely after events are sent to service via fixture.postEvent() or test itself gets in some sort of deadlock (was experimenting with various context + job configurations).
What am I doing wrong here?
class EventSenderService(
private val repository: EventRepository,
private val api: Api,
private val serializer: GsonSerializer,
private val requestBodyBuilder: EventRequestBodyBuilder,
) : EventSender, CoroutineScope {
private val eventBatchSize = 25
val job = Job()
private val channel = Channel<Unit>()
init {
job.start()
launch {
for (event in channel) {
val trackingEventCount = repository.getTrackingEventCount()
if (trackingEventCount < eventBatchSize) continue
readSendDelete()
}
}
}
override val coroutineContext: CoroutineContext
get() = Dispatchers.Default + job
override fun postEvent(event: Event) {
launch(Dispatchers.IO) {
writeEventToDatabase(event)
}
}
override fun close() {
channel.close()
job.cancel()
}
private fun readSendDelete() {
try {
val events = repository.getTrackingEvents(eventBatchSize)
val request = requestBodyBuilder.buildFor(events).blockingGet()
api.postEvents(request).blockingGet()
repository.deleteTrackingEvents(events)
} catch (throwable: Throwable) {
Log.e(throwable)
}
}
private suspend fun writeEventToDatabase(event: Event) {
try {
val trackingEvent = TrackingEvent(eventData = serializer.toJson(event))
repository.insert(trackingEvent)
channel.send(Unit)
} catch (throwable: Throwable) {
throwable.printStackTrace()
Log.e(throwable)
}
}
}
Test
#RunWith(RobolectricTestRunner::class)
class EventSenderServiceTest : CoroutineScope {
#Rule
#JvmField
val instantExecutorRule = InstantTaskExecutorRule()
private val api: Api = mock {
on { postEvents(any()) } doReturn Single.just(BaseResponse())
}
private val serializer: GsonSerializer = mock {
on { toJson<Any>(any()) } doReturn "event_data"
}
private val bodyBuilder: EventRequestBodyBuilder = mock {
on { buildFor(any()) } doReturn Single.just(TypedJsonString.buildRequestBody("[ { event } ]"))
}
val event = Event(EventName.OPEN_APP)
private val database by lazy {
Room.inMemoryDatabaseBuilder(
RuntimeEnvironment.systemContext,
Database::class.java
).allowMainThreadQueries().build()
}
private val repository by lazy { database.getRepo() }
val fixture by lazy {
EventSenderService(
repository = repository,
api = api,
serializer = serializer,
requestBodyBuilder = bodyBuilder,
)
}
override val coroutineContext: CoroutineContext
get() = Dispatchers.Default + fixture.job
#Test
fun eventBundling_success() = runBlocking {
(1..40).map { Event(EventName.OPEN_APP) }.forEach { fixture.postEvent(it) }
fixture.job.children.forEach { it.join() }
verify(api).postEvents(any())
assertEquals(15, eventDao.getTrackingEventCount())
}
}
After updating code as #Marko Topolnik suggested - adding fixture.job.children.forEach { it.join() } test never finishes.
One thing you're doing wrong is related to this:
override fun postEvent(event: Event) {
launch(Dispatchers.IO) {
writeEventToDatabase(event)
}
}
postEvent launches a fire-and-forget async job that will eventually write the event to the database. Your test creates 40 such jobs in rapid succession and, while they're queued, asserts the expected state. I can't work out, though, why you assert 15 events after posting 40.
To fix this issue you should use the line you already have:
fixture.job.join()
but change it to
fixture.job.children.forEach { it.join() }
and place it lower, after the loop that creates the events.
I failed to take into account the long-running consumer job you launch in the init block. This invalidates the advice I gave above to join all children of the master job.
Instead you'll have to make a bit more changes. Make postEvent return the job it launches and collect all these jobs in the test and join them. This is more selective and avoids joining the long-living job.
As a separate issue, your batching approach isn't ideal because it will always wait for a full batch before doing anything. Whenever there's a lull period with no events, the events will be sitting in the incomplete batch indefinitely.
The best approach is natural batching, where you keep eagerly draining the input queue. When there's a big flood of incoming events, the batch will naturally grow, and when they are trickling in, they'll still be served right away. You can see the basic idea here.

When submitting a spark streaming recevier, how to specify host without "failing through"?

I want to create a server socket to listen on, on a host that I know the ip and hostname ahead of time (and it shows up with that hostname in the yarn node list) . But I can't seem to get it to listen on that host without letting it fail an arbitrary number of times before hand.
There's a Flume receiver that has the sort of host-specific functionality I'm looking for.
FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
My receiver code:
class TCPServerReceiver(hostname: String, port: Int)
extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
}
private def receive() {
/* This is where the job fails until it happens to start on the correct host */
val server = new ServerSocket(port, 50, InetAddress.getByName(hostname))
var userInput: String = null
while (true) {
try {
val s = server.accept()
val in = new BufferedReader(new InputStreamReader(s.getInputStream()))
userInput = in.readLine()
while (!isStopped && userInput != null) {
store(userInput)
userInput = in.readLine()
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + port, e)
case t: Throwable =>
restart("Error receiving data", t)
}
}
}
}
And then to test it while it's running:
echo 'this is a test' | nc <hostname> <port>
This all works when I run as a local client, but when it's submitted to a yarn cluster, the logs show it trying to run in other containers on different hosts and all of them fail because the hostname doesn't match that of the container:
java.net.BindException: Cannot assign requested address
Eventually (after several minutes), it does create the socket once the receiver tries to start on the correct host, so the above code does work, but it takes a substantial amount of "boot time" and I'm worried that adding more nodes will cause it to take even longer!
Is there a way of ensuring that this receiver starts on the correct host on the first try?
The custom TCPServerReceiver implementation should also implement:
def preferredLocation: Option[String]
Override this to specify a preferred location (hostname).
In this case, something like:
def preferredLocation = Some(hostname)