How to "lock" records in kafka streams - apache-kafka

I have a stream application that reads from a topic and stores the state in a global store. There are several processors that write to the same topic by reading from the state store, updating a particular field, and writing it back to the topic.
I've noticed that some writes contain stale data and overwrites a record that was previously updated. I want to know what techniques I can use to achieve this "locking" of the records so that no processor updates the record while it is being read and processed by another. So far I believe this can be achieved by enabling exactly-once processing but would like your expert opinion on this and pointers to other things I might be missing.
streamBuilder
.addGlobalStore(
Stores
.keyValueStoreBuilder(
productGlobalState,
keySerde,
valueSerde
)
.withLoggingDisabled(),
"product-source-1",
consumedInstance,
processorSupplier
)
.stream("product-source-1")
.transform(transformer1)
.to("product-sink")
.stream("product-source-2")
.transform(transformer2)
.to("product-sink")
//state store processor
val productProcessorSupplier: ProcessorSupplier[ProductId, ProductValue] =
() =>
new AbstractProcessor[ProductId, ProductValue] {
private var store: KeyValueStore[ProductId, ProductValue] = _
override def init(context: ProcessorContext): Unit = {
super.init(context)
this.store = context
.getStateStore(ProductGlobalStoreName)
.asInstanceOf[KeyValueStore[ProductId, ProductValue]]
}
override def process(key: ProductId, value: ProductValue): Unit = {
store.put(key, value)
context().commit()
}
}
//transformer1
val transformer1: ProductTransformer =
() =>
new Transformer[
ProductId,
ProductValue,
KeyValue[ProductId, ProductValue]
] {
private var prodStore: KeyValueStore[ProductId, ProductValue] = _
override def init(context: ProcessorContext): Unit = {
this.prodStore = context
.getStateStore(ProductGlobalStoreName)
.asInstanceOf[KeyValueStore[ProductId, ProductValue]]
}
override def transform(key: ProductId, value: ProductValue): KeyValue[ProductId, ProductValue] = {
val updatedProd = Option(prodStore.get(key)).map { p =>
value.copy(isSale = p.isSale, isNew = p.isNew)//attempt to preserve these fields and not override them
} getOrElse value
KeyValue.pair(key, updatedProd)
}
override def close(): Unit = ()
}
//transformer2
val transformer2 = () => new Transformer[ProductId, ProductValue, KeyValue[ProductId, ProductValue]] {
private var context: ProcessorContext = _
protected var prodStore: KeyValueStore[ProductId, ProductValue] = _
override def init(context: ProcessorContext): Unit = {
this.context = context
this.prodStore = context
.getStateStore(ProductGlobalStoreName)
.asInstanceOf[KeyValueStore[ProductId, ProductValue]]
}
override def transform(key: ProductId, value: ProductValue): KeyValue[ProductId, ProductValue] = {
val product = prodStore.get(key)
val newValue = product.copy(product.isNew = true)
KeyValue.pair(key, newValue)
}
}
//... other transformers similar to transformer2 but updating a different fields.

Related

Kafka Streams Scala Memory Leak

Following use case:
I want to aggregate data for a specific time and then downstream them. Since the built-in suppress-feature does not support wall clock time, I have to implement this on my own by using a transformer.
After the time window is closed I downstream the aggregated data and delete them from the state store. I tested the behaviour with a limited amount of data. I.e. after all data have been processed the state store should be empty again and the memory should decrease. Unfortunately the memory always stays at the same level.
SuppressTransformer.scala
class SuppressTransformer[T](stateStoreName: String, windowDuration: Duration) extends Transformer[String, T, KeyValue[String, T]] {
val scheduleInterval: Duration = Duration.ofSeconds(180)
private val keySet = mutable.HashSet.empty[String]
var context: ProcessorContext = _
var store: SessionStore[String, Array[T]] = _
override def init(context: ProcessorContext): Unit = {
this.context = context;
this.store = context.getStateStore(stateStoreName).asInstanceOf[SessionStore[String, Array[T]]]
this.context.schedule(
scheduleInterval,
PunctuationType.WALL_CLOCK_TIME,
_ => {
for (key <- keySet) {
val storeEntry = store.fetch(key)
while (storeEntry.hasNext) {
val keyValue: KeyValue[Windowed[String], Array[T]] = storeEntry.next()
val peekKey = keyValue.key
val now = Instant.now()
val windowAge: Long = ChronoUnit.SECONDS.between(peekKey.window().startTime(), now)
if (peekKey.window().start() > 0 && windowAge > windowDuration.toSeconds) { // Check if window is exceeded. If yes, downstream data
val windowedKey: Windowed[String] = keyValue.key
val storeValue = keyValue.value
context.forward(key, storeValue, To.all().withTimestamp(now.toEpochMilli))
context.commit()
this.store.remove(windowedKey) // Delete entry from state store
keySet -= key
}
}
storeEntry.close() // Close iterator to avoid memory leak
}
}
)
}
override def transform(key: String, value: T): KeyValue[String, T] = {
if (!keySet.contains(key)) {
keySet += key
}
null
}
override def close(): Unit = {}
}
class SuppressTransformerSupplier[T](stateStoreName: String, windowDuration: Duration) extends TransformerSupplier[String, T, KeyValue[String, T]] {
override def get(): SuppressTransformer[T] = new SuppressTransformer(stateStoreName, windowDuration)
}
Topology.scala
val windowDuration = Duration.ofMinutes(5)
val stateStore: Materialized[String, util.ArrayList[Bytes], ByteArraySessionStore] =
Materialized
.as[String, util.ArrayList[Bytes]](
new RocksDbSessionBytesStoreSupplier(stateStoreName,
stateStoreRetention.toMillis)
)
builder.stream[String, Bytes](Pattern.compile(topic + "(-\\d+)?"))
.filter((k, _) => k != null)
.groupByKey
.windowedBy(SessionWindows `with` sessionWindowMinDuration `grace` sessionGracePeriodDuration)
.aggregate(initializer = {
new util.ArrayList[Bytes]()
}
)(aggregator = (_: String, instance: Bytes, agg: util.ArrayList[Bytes]) => {
agg.add(instance)
agg
}, merger = (_: String, state1: util.ArrayList[Bytes], state2: util.ArrayList[Bytes]) => {
state1.addAll(state2)
state1
}
)(stateStore)
.toStream
.map((k, v) => (k.key(), v))
.transform(new SuppressTransformerSupplier[util.ArrayList[Bytes]](stateStoreName, windowDuration), stateStoreName)
.unsetRepartitioningRequired()
.to(f"$topic-aggregated")
I don't think that is a memory leak. I mean it could be. But from what you mentioned, it looks like normal Java behavior.
What happens is that JVM takes all the memory that it can. It is the heap memory and the maximum is configured by the Xmx option. Your state takes it all (I assume, based on the graph) and then releases the objects. But JVM normally doesn't release the memory back to the OS. That is the reason your pod is always at its highest.
There are a few garbage colletors that could possibly do that for you.
I personally use the GC that is faster and let JVM take as much memory as it requires. At the end of the day, that's the power of pod isolation. I normally set the heap max to %80 of the pod max memory.
Here is a related question Does GC release back memory to OS?

How do you test the read-side processor in Lagom?

I have a read-side that is supposed to write entries to Cassandra, I would like to write a test that ensure that sends an event to the read-side and then check in Cassandra that the row has indeed been written. How am I supposed to access a Cassandra session within the test?
I do it following way:
class MyProcessorSpec extends AsyncWordSpec with BeforeAndAfterAll with Matchers {
private val server = ServiceTest.startServer(ServiceTest.defaultSetup.withCassandra(true)) { ctx =>
new MyApplication(ctx) {
override def serviceLocator = NoServiceLocator
override lazy val readSide: ReadSideTestDriver = new ReadSideTestDriver
}
}
override def afterAll(): Unit = server.stop()
private val testDriver = server.application.readSide
private val repository = server.application.repo
private val offset = new AtomicInteger()
"The event processor" should {
"create an entity" in {
for {
_ <- feed(createdEvent.id, createdEvent)
entity <- repository.getEntityIdByKey(createdEvent.keys.head)
entities <- repository.getAllEntities
} yield {
entity should be(Some(createdEvent.id))
entities.length should be(1)
}
}
}
private def feed(id: MyId, event: MyEvent): Future[Done] = {
testDriver.feed(id.underlying, event, Sequence(offset.getAndIncrement))
}
}

Random fail on KafkaStreams stateful application

Hi here is a problem I stumble upon since a few days and can't find the answer by myself.
I am using the scala streams API v2.0.0.
I have two incoming streams, branched over two handlers for segregation and both declaring a Transformer using a common StateStore.
To do a quick overview, it looks like
def buildStream(builder: StreamsBuilder, config: Config) = {
val store = Stores.keyValueStoreBuilder[String, AggregatedState](Stores.persistentKeyValueStore(config.storeName), ...)
builder.addStateStore(store)
val handlers = List(handler1, handler2)
builder
.stream(config.topic)
.branch(handlers.map(_.accepts).toList: _*) // Dispatch events to the first handler accepting it
.zip(handlers.toList) // (KStream[K, V], Handler)
.map((h, stream) => h.handle(stream)) // process the event on the correct handler
.reduce((s1, s2) => s1.merge(s2)) // merge them back as they return the same object
.to(config.output)
builder
}
Each of my handlers look the same: Take an event, do some operations, pass through the transform() method to derive a state and emit an aggregate:
class Handler1(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer1(config.storeName))
}
}
class Handler2(config: Config) {
def accepts(key: String, value: Event): Boolean = ??? // Implementation not needed
def handle(stream: KStream[String, Event]) = {
stream
.(join/map/filter)
.transform(new Transformer2(config.storeName))
}
}
The transformers use the same StateStore with the following logic: for a new event, check if its aggregate exists, if yes, update it + store it + emit the new aggregate, otherwise build the aggregate + store it + emit .
class Transformer1(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
class Transformer2(storeName: String) {
private var store: KeyValueStore[String, AggregatedState] = _
override def init(context: ProcessorContext): Unit = {
store = context.getStateStore(storeName).asInstanceOf[KeyValueStore[K, AggregatedState]]
}
override def transform(key: String, value: Event): (String, AggregatedState) = {
val existing: Option[AggregatedState] = Option(store.get(key))
val agg = existing.map(_.updateWith(event)).getOrElse(new AggregatedState(event))
store.put(key, agg)
if(agg.isTerminal){
store.delete(key)
}
if(isDuplicate(existing, agg)){
null // Tombstone, we have a duplicate
} else{
(key, agg) // Emit the new aggregate
}
}
override def close() = Unit
}
Transformer2 is the same, it's just the business logic that changes (how to merge a new event with an aggregated state)
The problem I have is that on stream startup, I can either have a normal startup or a boot exception :
15:07:23,420 ERROR org.apache.kafka.streams.processor.internals.AssignedStreamsTasks - stream-thread [job-tracker-prod-5ba8c2f7-d7fd-48b5-af4a-ac78feef71d3-StreamThread-1] Failed to commit stream task 1_0 due to the following error:
org.apache.kafka.streams.errors.ProcessorStateException: task [1_0] Failed to flush state store KSTREAM-AGGREGATE-STATE-STORE-0000000003
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:242)
at org.apache.kafka.streams.processor.internals.AbstractTask.flushState(AbstractTask.java:198)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:406)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:380)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:368)
at org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:67)
at org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:362)
at org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:352)
at org.apache.kafka.streams.processor.internals.TaskManager.commitAll(TaskManager.java:401)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:1035)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:845)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: java.lang.IllegalStateException: This should not happen as timestamp() should only be called while a record is processed
at org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:161)
at org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:66)
at org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:31)
at org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:206)
at org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
at com.mycompany.streamprocess.Transformer1.transform(Transformer1.scala:49) // Line with store.put(key, agg)
I already searched and got results with "The transformer uses a Factory Pattern", which is what is used here (as the .transform takes the transformer and creates a TransformerSupplier under the hood).
As the error is pseudo-random (I could re-create it some times), I guess it could be a race condition on startup but I found nothing concluding.
Is it because I use the same state-store on different transformers?
I assume you are hitting https://issues.apache.org/jira/browse/KAFKA-7250
It's fixed in version 2.0.1 and 2.1.0.
If you cannot upgrade, you need to pass in the TransformerSupplier explicitly, because the Scale API constructs the supplier incorrectly in 2.0.0.
.transform(() => new Transformer1(config.storeName))

Akka streams change return type of 3rd party flow/stage

I have a graph that reads from sqs, writes to another system and then deletes from sqs. In order to delete from sqs i need a receipt handle on the SqsMessage object
In the case of Http flows the signature of the flow allows me to say which type gets emitted downstream from the flow,
Flow[(HttpRequest, T), (Try[HttpResponse], T), HostConnectionPool]
In this case i can set T to SqsMessage and i still have all the data i need.
However some connectors e.g google cloud pub sub emits a completely useless (to me) pub sub id.
Downstream of the pub sub flow I need to be able to access the sqs message id which i had prior to the pub sub flow.
What is the best way to work around this without rewriting the pub sub connector
I conceptually want something a bit like this:
Flow[SqsMessage] //i have my data at this point
within(
.map(toPubSubMessage)
.via(pubSub))
... from here i have the same type i had before within however it still behaves like a linear graph with back pressure etc
You can use PassThrough integration pattern.
As example of usage look on akka-streams-kafka -> Class akka.kafka.scaladsl.Producer -> Mehtod def flow[K, V, PassThrough]
So just implement your own Stage with PassThrough element, example inakka.kafka.internal.ProducerStage[K, V, PassThrough]
package my.package
import java.util.concurrent.atomic.AtomicInteger
import scala.concurrent.Future
import scala.util.{Failure, Success, Try}
import akka.stream._
import akka.stream.ActorAttributes.SupervisionStrategy
import akka.stream.stage._
final case class Message[V, PassThrough](record: V, passThrough: PassThrough)
final case class Result[R, PassThrough](result: R, message: PassThrough)
class PathThroughStage[R, V, PassThrough]
extends GraphStage[FlowShape[Message[V, PassThrough], Future[Result[R, PassThrough]]]] {
private val in = Inlet[Message[V, PassThrough]]("messages")
private val out = Outlet[Result[R, PassThrough]]("result")
override val shape = FlowShape(in, out)
override protected def createLogic(inheritedAttributes: Attributes) = {
val logic = new GraphStageLogic(shape) with StageLogging {
lazy val decider = inheritedAttributes.get[SupervisionStrategy]
.map(_.decider)
.getOrElse(Supervision.stoppingDecider)
val awaitingConfirmation = new AtomicInteger(0)
#volatile var inIsClosed = false
var completionState: Option[Try[Unit]] = None
override protected def logSource: Class[_] = classOf[PathThroughStage[R, V, PassThrough]]
def checkForCompletion() = {
if (isClosed(in) && awaitingConfirmation.get == 0) {
completionState match {
case Some(Success(_)) => completeStage()
case Some(Failure(ex)) => failStage(ex)
case None => failStage(new IllegalStateException("Stage completed, but there is no info about status"))
}
}
}
val checkForCompletionCB = getAsyncCallback[Unit] { _ =>
checkForCompletion()
}
val failStageCb = getAsyncCallback[Throwable] { ex =>
failStage(ex)
}
setHandler(out, new OutHandler {
override def onPull() = {
tryPull(in)
}
})
setHandler(in, new InHandler {
override def onPush() = {
val msg = grab(in)
val f = Future[Result[R, PassThrough]] {
try {
Result(// TODO YOUR logic
msg.record,
msg.passThrough)
} catch {
case exception: Exception =>
decider(exception) match {
case Supervision.Stop =>
failStageCb.invoke(exception)
case _ =>
Result(exception, msg.passThrough)
}
}
if (awaitingConfirmation.decrementAndGet() == 0 && inIsClosed) checkForCompletionCB.invoke(())
}
awaitingConfirmation.incrementAndGet()
push(out, f)
}
override def onUpstreamFinish() = {
inIsClosed = true
completionState = Some(Success(()))
checkForCompletion()
}
override def onUpstreamFailure(ex: Throwable) = {
inIsClosed = true
completionState = Some(Failure(ex))
checkForCompletion()
}
})
override def postStop() = {
log.debug("Stage completed")
super.postStop()
}
}
logic
}
}

Neo4j OGM example with Scala

I tried the example mentioned in Luanne's article The essence of Spring Data Neo4j 4 in Scala. The code can be found in the neo4j-ogm-scala repository.
package neo4j.ogm.scala.domain
import org.neo4j.ogm.annotation.GraphId;
import scala.beans.BeanProperty
import org.neo4j.ogm.annotation.NodeEntity
import org.neo4j.ogm.annotation.Relationship
import org.neo4j.ogm.session.Session;
import org.neo4j.ogm.session.SessionFactory;
abstract class Entity {
#GraphId
#BeanProperty
var id: Long = _
override def equals(o: Any): Boolean = o match {
case other: Entity => other.id.equals(this.id)
case _ => false
}
override def hashCode: Int = id.hashCode()
}
#NodeEntity
class Category extends Entity {
var name: String = _
def this(name: String) {
this()
this.name = name
}
}
#NodeEntity
class Ingredient extends Entity {
var name: String = _
#Relationship(`type` = "HAS_CATEGORY", direction = "OUTGOING")
var category: Category = _
#Relationship(`type` = "PAIRS_WITH", direction = "UNDIRECTED")
var pairings: Set[Pairing] = Set()
def addPairing(pairing: Pairing): Unit = {
pairing.first.pairings +(pairing)
pairing.second.pairings +(pairing)
}
def this(name: String, category: Category) {
this()
this.name = name
this.category = category
}
}
#RelationshipEntity(`type` = "PAIRS_WITH")
class Pairing extends Entity {
#StartNode
var first: Ingredient = _
#EndNode
var second: Ingredient = _
def this(first: Ingredient, second: Ingredient) {
this()
this.first = first
this.second = second
}
}
object Neo4jSessionFactory {
val sessionFactory = new SessionFactory("neo4j.ogm.scala.domain")
def getNeo4jSession(): Session = {
System.setProperty("username", "neo4j")
System.setProperty("password", "neo4j")
sessionFactory.openSession("http://localhost:7474")
}
}
object Main extends App {
val spices = new Category("Spices")
val turmeric = new Ingredient("Turmeric", spices)
val cumin = new Ingredient("Cumin", spices)
val pairing = new Pairing(turmeric, cumin)
cumin.addPairing(pairing)
val session = Neo4jSessionFactory.getNeo4jSession()
val tx: Transaction = session.beginTransaction()
try {
session.save(spices)
session.save(turmeric)
session.save(cumin)
session.save(pairing)
tx.commit()
} catch {
case e: Exception => // tx.rollback()
} finally {
// tx.commit()
}
}
The problem is that nothing gets saved to Neo4j. Can you please point out the problem in my code?
Thanks,
Manoj.
Scala’s Long is an instance of a Value class. Value classes were explicitly introduced to avoid allocating runtime objects. At the JVM level therefore Scala's Long is equivalent to Java’s primitive long which is why it has the primitive type signature J. It cannot be therefore be null, and should not be used as a graphId. Although Scala mostly will do auto-boxing between its own Long and Java’s Long class, this doesn’t apply to declarations, only to operations on those objects.
The #GraphId isn't being picked up on your entities. I have zero knowledge of Scala but it looks like the scala long isn't liked much by the OGM; var id: java.lang.Long = _ works fine.