I'm using Kafka consumer with Flink 1.9 (in Scala 2.12), and facing the following problem (similar to this question): the consumer should stop fetching data (and finish the task) when no new messages are received for a specific amount of time (since the stream is potentially infinite, so there is no "end-of-stream" message in the topic itself).
I've tried to use ProcessFunction which calls consumer.close(), but this did not help (consumer continues to run). Throwing an exception in ProcessFunction kills the job completely, which is not what I want (since the job consists of several stages, which are canceled after throwing an exception). Here is my ProcessFunction:
class TimeOutFunction( // delay after which an alert flag is thrown
val timeOut: Long, consumer: FlinkKafkaConsumer[Row]
) extends ProcessFunction[Row, Row] {
// state to remember the last timer set
private var lastTimer: ValueState[Long] = _
override def open(conf: Configuration): Unit = { // setup timer state
val lastTimerDesc = new ValueStateDescriptor[Long]("lastTimer", classOf[Long])
lastTimer = getRuntimeContext.getState(lastTimerDesc)
}
override def processElement(value: Row, ctx: ProcessFunction[Row, Row]#Context, out: Collector[Row]): Unit = { // get current time and compute timeout time
val currentTime = ctx.timerService.currentProcessingTime
val timeoutTime = currentTime + timeOut
// register timer for timeout time
ctx.timerService.registerProcessingTimeTimer(timeoutTime)
// remember timeout time
lastTimer.update(timeoutTime)
// throughput the event
out.collect(value)
}
override def onTimer(timestamp: Long, ctx: ProcessFunction[Row, Row]#OnTimerContext, out: Collector[Row]): Unit = {
// check if this was the last timer we registered
if (timestamp == lastTimer.value) {
// it was, so no data was received afterwards.
// stop the consumer.
consumer.close()
}
}
}
The isEndOfStream() method on a deserialization schema is also no good, since it requires nextElement (and my case is kind of vice-versa, since the stream should stop when there is no next element for some time).
So, there is a way to do this (preferably without subclassing FlinkKafkaConsumer and/or using reflection)?
Related
netty-all:4.1.48.Final
I am having a cryptic issue with Netty that seems to only show up in Kubernetes. I have a clone of the project running on a cloud instance with less resources that does not have this issue. Both projects receive the same amount of traffic (I am resending the same traffic from a third provider to both Netty servers).
In kubernetes, every time a channel is opened (I send a message) I increment my session counter. Every time the channel reads data, I increment a read counter. I am sending data every time so I would expect to see at the very least one read for every session (more if the data were long enough) but not less. The counters drift apart rather smoothly until the amount of reads stays around half of the amount of opened sessions.
Is there any way for me to diagnose this issue? I have written the barebones netty server I am using (with the configuration, including an idle timer). Am I blocking Netty resources?
class Server {
private val bossGroup = NioEventLoopGroup()
private val workerGroup = NioEventLoopGroup()
fun start() {
ServerBootstrap()
.group(bossGroup, workerGroup)
.option(ChannelOption.SO_REUSEADDR, true)
.option(ChannelOption.AUTO_CLOSE, false)
.channel(NioServerSocketChannel::class.java)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(ChannelOption.TCP_NODELAY, true)
.childHandler(object : ChannelInitializer<SocketChannel>() {
override fun initChannel(channel: SocketChannel) {
val idleTimeTrigger = 1
val idleStateHandler = IdleStateHandler(0, 0, idleTimeTrigger)
channel
.pipeline()
.addLast("idleStateHandler", idleStateHandler)
.addLast(Session(idleTimeTrigger))
}
})
.bind(8888)
.sync()
.channel()
.closeFuture()
.sync()
}
}
class Session(
private val idleTimeTrigger: Int,
) : ChannelInboundHandlerAdapter() {
// session counter
val idleTimeout = 10
var idleTickCounter = 0L
override fun channelRead(ctx: ChannelHandlerContext, msg: Any) {
// read counter is less than session counter... HUH????
this.idleTickCounter = 0
try {
val data = (msg as ByteBuf).toString(CharsetUtil.UTF_8)
// ... do my stuff ..
// output counter is less than session counter
} finally {
ReferenceCountUtil.release(msg)
}
}
override fun userEventTriggered(ctx: ChannelHandlerContext, evt: Any) {
this.idleTickCounter++
val idleTime = idleTimeTrigger * idleTickCounter
if (idleTime > idleTimeout) {
// idle timeout counter is always 0
ctx.close()
}
super.userEventTriggered(ctx, evt)
}
override fun exceptionCaught(ctx: ChannelHandlerContext, cause: Throwable) {
// error counter is always 0
ctx.close()
}
}
The output is being passed to a rabbit AMQP client and sent to a queue. I don't know if this is relevant (with regards to resource usage) but the AMQP client uses Jetty
I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?
Situation
I am using akka actors to update data on my web-client. One of those actors is solely repsonsible for sending updates concerning single Agents. These agents are updated very rapidly (every 10ms). My goal now is to throttle this updating mechanism so that the newest version of every Agent is sent every 300ms.
My code
This is what I came up with so far:
/**
* Single agents are updated very rapidly. To limit the burden on the web-frontend, we throttle the messages here.
*/
class BroadcastSingleAgentActor extends Actor {
private implicit val ec: ExecutionContextExecutor = context.dispatcher
private var queue = Set[Agent]()
context.system.scheduler.schedule(0 seconds, 300 milliseconds) {
queue.foreach { a =>
broadcastAgent(self)(a) // sends the message to all connected clients
}
queue = Set()
}
override def receive: Receive = {
// this message is received every 10 ms for every agent present
case BroadcastAgent(agent) =>
// only keep the newest version of the agent
queue = queue.filter(_.id != agent.id) + agent
}
}
Question
This actor (BroadcastSingleAgentActor) works as expected, but I am not 100% sure if this is thread safe (updating the queue while potentionally clearing it). Also, this does not feel like I am making the best out of the tools akka provides me with. I found this article (Throttling Messages in Akka 2), but my problem is that I need to keep the newest Agent message while dropping any old version of it. Is there an example somewhere similar to what I need?
No, this isn't thread safe because the scheduling via the ActorSystem will happen on another thread than the receive. One potential idea is to do the scheduling within the receive method because incoming messages to the BroadcastSingleAgentActor will be handled sequentially.
override def receive: Receive = {
case Refresh =>
context.system.scheduler.schedule(0 seconds, 300 milliseconds) {
queue.foreach { a =>
broadcastAgent(self)(a) // sends the message to all connected clients
}
}
queue = Set()
// this message is received every 10 ms for every agent present
case BroadcastAgent(agent) =>
// only keep the newest version of the agent
queue = queue.filter(_.id != agent.id) + agent
}
iam new to kafka i want to implement a kafka message pipeline system for our play-scala project,i created a topic in which records are inserted, and i wrote a consumer code like this,
val recordBatch = new ListBuffer[CountryModel]
consumer.subscribe(Arrays.asList("session-queue"))
while (true) {
val records = consumer.poll(1000)
val iter = records.iterator()
while (iter.hasNext()) {
val record: ConsumerRecord[String, String] = iter.next()
recordBatch += Json.parse(record.value()).as[CountryModel]
}
processData(recordBatch)
// Thread.sleep(5 * 1000)
}
but after certain time the service is stopping, the processor is going to 100% after certain time machine is stopping, how can i handle this infinite loop.
in production environment i can not rely on this loop right i tried to sleep the thread for some time but it is not a elegant solution
I have an Akka Actor that I want to send "control" messages to.
This Actor's core mission is to listen on a Kafka queue, which is a polling process inside a loop.
I've found that the following simply locks up the Actor and it won't receive the "stop" (or any other) message:
class Worker() extends Actor {
private var done = false
def receive = {
case "stop" =>
done = true
kafkaConsumer.close()
// other messages here
}
// Start digesting messages!
while (!done) {
kafkaConsumer.poll(100).iterator.map { cr: ConsumerRecord[Array[Byte], String] =>
// process the record
), null)
}
}
}
I could wrap the loop in a Thread started by the Actor, but is it ok/safe to start a Thread from inside an Actor? Is there a better way?
Basically you can but keep in mind that this actor will be blocking and a thumb of rule is to never block inside actors. If you still want to do this, make sure that this actor runs in a separate thread pool than the native one so you don't affect Actor System performances. One another way to do it would be to send messages to itself to poll new messages.
1) receive a order to poll a message from kafka
2) Hand over the
message to the relevant actor
3) Send a message to itself to order
to pull a new message
4) Hand it over...
Code wise :
case object PollMessage
class Worker() extends Actor {
private var done = false
def receive = {
case PollMessage ⇒ {
poll()
self ! PollMessage
}
case "stop" =>
done = true
kafkaConsumer.close()
// other messages here
}
// Start digesting messages!
def poll() = {
kafkaConsumer.poll(100).iterator.map { cr: ConsumerRecord[Array[Byte], String] =>
// process the record
), null)
}
}
}
I am not sure though that you will ever receive the stop message if you continuously block on the actor.
Adding #Louis F. answer; depending on the configuration of your actors they will either drop all messages that they receive if at the given moment they are busy or put them in a mailbox aka queue and the messages will be processed later (usually in FIFO manner). However, in this particular case you are flooding the actor with PollMessage and you have no guarantee that your message will not be dropped - which appears to happen in your case.