my question is about Kafka streams Ktable.groupBy.aggregate. and the resulting aggregated values.
situation
I am trying to aggregate minute events per day.
I have a minute event generator (not shown here) that generates events for a few houses. Sometimes the event value is wrong and the minute event must be republished.
Minute events are published in the topic "minutes".
I am doing an aggregation of these events per day and house using kafka Streams groupBy and aggregate.
problem
Normally as there are 1440 minutes in a day, there should never have an aggregation with more than 1440 values.
Also there should never have an aggregation with a negative amount of events.
... But it happens anyways and we do not understand what is wrong in our code.
sample code
Here is a sample simplified code to illustrate the problem. The IllegalStateException is thrown sometimes.
StreamsBuilder builder = new StreamsBuilder();
KTable<String, MinuteEvent> minuteEvents = builder.table(
"minutes",
Consumed.with(Serdes.String(), minuteEventSerdes),
Materialized.<String, MinuteEvent, KeyValueStore<Bytes, byte[]>>with(Serdes.String(), minuteEventSerdes)
.withCachingDisabled());
// preform daily aggregation
KStream<String, MinuteAggregate> dayEvents = minuteEvents
// group by house and day
.filter((key, minuteEvent) -> minuteEvent != null && StringUtils.isNotBlank(minuteEvent.house))
.groupBy((key, minuteEvent) -> KeyValue.pair(
minuteEvent.house + "##" + minuteEvent.instant.atZone(ZoneId.of("Europe/Paris")).truncatedTo(ChronoUnit.DAYS), minuteEvent),
Grouped.<String, MinuteEvent>as("minuteEventsPerHouse")
.withKeySerde(Serdes.String())
.withValueSerde(minuteEventSerdes))
.aggregate(
MinuteAggregate::new,
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.addLine(key, value),
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.removeLine(key, value),
Materialized
.<String, MinuteAggregate, KeyValueStore<Bytes, byte[]>>as(BILLLINEMINUTEAGG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(minuteAggSerdes)
.withLoggingEnabled(new HashMap<>())) // keep this aggregate state forever
.toStream();
// check daily aggregation
dayEvents.filter((key, value) -> {
if (value.nbValues < 0) {
throw new IllegalStateException("got an aggregate with a negative number of values " + value.nbValues);
}
if (value.nbValues > 1440) {
throw new IllegalStateException("got an aggregate with too many values " + value.nbValues);
}
return true;
}).to("days", minuteAggSerdes);
and here are the sample class used in this code snippet :
public class MinuteEvent {
public final String house;
public final double sensorValue;
public final Instant instant;
public MinuteEvent(String house,double sensorValue, Instant instant) {
this.house = house;
this.sensorValue = sensorValue;
this.instant = instant;
}
}
public class MinuteAggregate {
public int nbValues = 0;
public double totalSensorValue = 0.;
public String house = "";
public MinuteAggregate addLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues + 1;
this.totalSensorValue = this.totalSensorValue + value.sensorValue;
this.house = value.house;
return this;
}
public MinuteAggregate removeLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues -1;
this.totalSensorValue = this.totalSensorValue - value.sensorValue;
return this;
}
public MinuteAggregate() {
}
}
If someone could tell us what we are doing wrong here and why we have these unexpected values that would be great.
additional notes
we configure our stream job to run with 4 threads properties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
we are forced to use a Ktable.groupBy().aggregate() because minute values can be
republished with different sensorValue for an already published Instant. And daily aggregation modified accordingly.
Stream.groupBy().aggregate() does not have an adder AND a substractor.
I think, it is actually possible that the count become negative temporary.
The reason is, that each update in your first KTable sends two messaged downstream -- the old value to be subtracted in the downstream aggregation and the new value to be added to the downstream aggregation. Both message will be processed independently in the downstream aggregation.
If the current count is zero, and a subtractions is processed before an addition, the count would become negative temporarily.
I have two kafka sources
I am trying to perform world count and merge the counts from two streams
I have created window of 1 min for both data streams and applying coGroupBykey , from DoFn , i am emitting <Key,Value> (word,count)
On top of this coGroupByKey function , I am applying stateful ParDo
Let say if i get (Test,2) from stream 1, (Test,3) from stream 2 in same window time then in CogroupByKey function , i ll merge as (Test,5), but if they are not falling in same window , i will emit (Test,2) and (Test,3)
Now i will apply state for merging these elements
So finally as result i should get (Test,5), but i am not getting the expected result , All elements form stream 1 are going to one partition and
elements from stream 2 to another partition , thats why i am getting result
(Test,2)
(Test,3)
// word count stream from kafka topic 1
PCollection<KV<String,Long>> stream1 = ...
// word count stream from kafka topic 2
PCollection<KV<String,Long>> stream2 = ...
PCollection<KV<String,Long>> windowed1 =
stream1.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
PCollection<KV<String,Long>> windowed2 =
stream2.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
final TupleTag<Long> count1 = new TupleTag<Long>();
final TupleTag<Long> count2 = new TupleTag<Long>();
// Merge collection values into a CoGbkResult collection.
PCollection<KV<String, CoGbkResult>> joinedStream =
KeyedPCollectionTuple.of(count1, windowed1).and(count2, windowed2)
.apply(CoGroupByKey.<String>create());
// applying state operation after coGroupKey fun
PCollection<KV<String,Long>> finalCountStream =
joinedStream.apply(ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String,Long>>() {
#StateId(stateId)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId(stateId) MapState<String, Long> state) {
KV<String, CoGbkResult> element = processContext.element();
Iterable<Long> count1 = element.getValue().getAll(web);
Iterable<Long> count2 = element.getValue().getAll(assist);
Long sumAmount =
StreamSupport
.stream(
Iterables.concat(count1, count2).spliterator(), false)
.collect(Collectors.summingLong(n -> n));
System.out.println(element.getKey()+"::"+sumAmount);
// processContext.output(element.getKey()+"::"+sumAmount);
Long currCount =
state.get(element.getKey()).read() == null
? 0L
: state.get(element.getKey()).read();
Long newCount = currCount+sumAmount;
state.put(element.getKey(),newCount);
processContext.output(KV.of(element.getKey(),newCount));
}
}));
finalCountStream
.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext c,
#StateId(myState) MapState<String, Long> state) {
KV<String,Long> e = c.element();
Long currCount = state.get(e.getKey()).read()==null
? 0L
: state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
}))
.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values());
Alternatively, you can use a Flatten + Combine approach, which should be give you simpler code:
PCollection<KV<String, Long>> pc1 = ...;
PCollection<KV<String, Long>> pc2 = ...;
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(pc1).and(pc2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply(windiw...).apply(Combine.perKey(Sum.ofLongs()))
You have set up both streams with the trigger Repeatedly.forever(AfterPane.elementCountAtLeast(1)) and discardingFiredPanes(). This will cause the CoGroupByKey to output as soon as possible after each input element and then reset its state each time. So it is normal behavior that it basically passes each input straight through.
Let me explain more: CoGroupByKey is executed like this:
All elements from stream1 and stream2 are tagged as you specified. So every (key, value1) from stream1 effectively becomes (key, (count1, value1)). And every (key, value2) from stream2 becomes `(key, (count2, value2))
These tagged collects are flattened together. So now there is one collection with elements like (key, (count1, value1)) and (key, (count2, value2)).
The combined collection goes through a normal GroupByKey. This is where triggers happen. So with the default trigger, you get (key, [(count1, value1), (count2, value2), ...]) with all the values for a key getting grouped. But with your trigger, you will often get separate (key, [(count1, value1)]) and (key, [(count2, value2)]) because each grouping fires right away.
The output of the GroupByKey is wrapped in just an API that is CoGbkResult. In many runners this is just a filtered view of the grouped iterable.
Of course, triggers are nondeterministic and runners are also allowed to have different implementations of CoGroupByKey. But the behavior you are seeing is expected. You probably don't want to use trigger like that or discarding mode, or else you need to do more grouping downstream.
Generally, doing a join with CoGBK is going to require some work downstream, until Beam supports retractions.
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class)
.setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KV<String,Long>> stream1 = new KafkaWordCount("localhost:9092","test1")
.build(p);
PCollection<KV<String,Long>> stream2 = new KafkaWordCount("localhost:9092","test2")
.build(p);
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(stream1).and(stream2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState = StateSpecs.map();
#ProcessElement
public void processElement(ProcessContext c, #StateId(myState) MapState<String, Long> state){
KV<String,Long> e = c.element();
System.out.println("Thread ID :"+ Thread.currentThread().getId());
Long currCount = state.get(e.getKey()).read()==null? 0L:state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
})).apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values()
);
p.run().waitUntilFinish();
I created an application which is using Spark-Streaming with the custom receiver Google Pub/Sub.
I hit my performance limit and interested to drop messages without processing it. I had an idea to store() sub of reading messages
I used apache/bahir receiver
val pullResponse = client.projects().subscriptions().pull(subscriptionFullName, pullRequest).execute()
val receivedMessages = pullResponse.getReceivedMessages.asScala.toList
Utils.LOG.info(s"receivedMessages from PUB/SUB ${receivedMessages.size}")
rateLimiter.acquire(receivedMessages.size)
var factor: Int = 0
if (dropFactorBroad != null) {
factor = dropFactorBroad.value
} else {
Utils.LOG.info("dropFactorBroad is null")
}
val endIndex = if (factor > receivedMessages.length) receivedMessages.length else factor
val messagesToStore = receivedMessages.slice(0, receivedMessages.length - endIndex)
store(messagesToStore.map(x => {
val sm = new SparkPubsubMessage
sm.message = x.getMessage
sm
})
.iterator)
val ackRequest = new AcknowledgeRequest()
ackRequest.setAckIds(receivedMessages.map(x => x.getAckId).asJava)
client.projects().subscriptions().acknowledge(subscriptionFullName, ackRequest).execute()
dropFactorBroad - is Broadcast variable which is updated on every onBatchCompleted(unpersisted and created again)
It is not working, I'm getting
java.lang.NullPointerException
at com.mag.ingester.ReceiverDropFactorBroadcaster.value(ReceiverDropFactorBroadcaster.scala:20)
at com.mag.pubSubReceiver.PubsubReceiver.receive(PubsubInputDStream.scala:260)
at com.mag.pubSubReceiver.PubsubReceiver$$anon$1.run(PubsubInputDStream.scala:244)
ReceiverDropFactorBroadcaster is dropFactorBroad
How can I control the receive store?
Should I kill receivers change the variable and start it again? (How can it be done?)
Thanks
So I was reading tutorial about akka and came across this http://manuel.bernhardt.io/2014/04/23/a-handful-akka-techniques/ and I think he explained it pretty well, I just picked up scala recently and having difficulties with the tutorial above,
I wonder what is the difference between RoundRobinRouter and the current RoundRobinRouterLogic? Obviously the implementation is quite different.
Previously the implementation of RoundRobinRouter is
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
with processBatch
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
log.info(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
workers ! item
}
}
}
Here's my implementation of RoundRobinRouterLogic
var mappings : Option[ActorRef] = None
var router = {
val routees = Vector.fill(100) {
mappings = Some(context.actorOf(Props[Application3]))
context watch mappings.get
ActorRefRoutee(mappings.get)
}
Router(RoundRobinRoutingLogic(), routees)
}
and treated the processBatch as such
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
println(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
// println(item.id)
mappings.get ! item
}
}
}
I somehow cannot run this tutorial, and it's stuck at the point where it's iterating the batch list. I wonder what I did wrong.
Thanks
In the first place, you have to distinguish diff between them.
RoundRobinRouter is a Router that uses round-robin to select a connection.
While
RoundRobinRoutingLogic uses round-robin to select a routee
You can provide own RoutingLogic (it has helped me to understand how Akka works under the hood)
class RedundancyRoutingLogic(nbrCopies: Int) extends RoutingLogic {
val roundRobin = RoundRobinRoutingLogic()
def select(message: Any, routees: immutable.IndexedSeq[Routee]): Routee = {
val targets = (1 to nbrCopies).map(_ => roundRobin.select(message, routees))
SeveralRoutees(targets)
}
}
link on doc http://doc.akka.io/docs/akka/2.3.3/scala/routing.html
p.s. this doc is very clear and it has helped me the most
Actually I misunderstood the method, and found out the solution was to use RoundRobinPool as stated in http://doc.akka.io/docs/akka/2.3-M2/project/migration-guide-2.2.x-2.3.x.html
For example RoundRobinRouter has been renamed to RoundRobinPool or
RoundRobinGroup depending on which type you are actually using.
from
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
to
val workers = context.actorOf(RoundRobinPool(100).props(Props[ItemProcessingWorker]), "router2")
I have a scenario where a certain number of operations including a group by has to be applied on a number of small (~300MB each) files. The operation looks like this..
df.groupBy(....).agg(....)
Now to process it on multiple files, I can use a wildcard "/**/*.csv" however, that creates a single RDD and partitions it to for the operations. However, looking at the operations, it is a group by and involves lot of shuffle which is unnecessary if the files are mutually exclusive.
What, I am looking at is, a way where i can create independent RDD's on files and operate on them independently.
It is more an idea than a full solution and I haven't tested it yet.
You can start with extracting your data processing pipeline into a function.
def pipeline(f: String, n: Int) = {
sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load(f)
.repartition(n)
.groupBy(...)
.agg(...)
.cache // Cache so we can force computation later
}
If your files are small you can adjust n parameter to use as small number of partitions as possible to fit data from a single file and avoid shuffling. It means you are limiting concurrency but we'll get back to this issue later.
val n: Int = ???
Next you have to obtain a list of input files. This step depends on a data source but most of the time it is more or less straightforward:
val files: Array[String] = ???
Next you can map above list using pipeline function:
val rdds = files.map(f => pipeline(f, n))
Since we limit concurrency at the level of the single file we want to compensate by submitting multiple jobs. Lets add a simple helper which forces evaluation and wraps it with Future
import scala.concurrent._
import ExecutionContext.Implicits.global
def pipelineToFuture(df: org.apache.spark.sql.DataFrame) = future {
df.rdd.foreach(_ => ()) // Force computation
df
}
Finally we can use above helper on the rdds:
val result = Future.sequence(
rdds.map(rdd => pipelineToFuture(rdd)).toList
)
Depending on your requirements you can add onComplete callbacks or use reactive streams to collect the results.
If you have many files, and each file is small (you say 300MB above which I would count as small for Spark), you could try using SparkContext.wholeTextFiles which will create an RDD where each record is an entire file.
By this way we can write multiple RDD parallely
public class ParallelWriteSevice implements IApplicationEventListener {
private static final IprogramLogger logger = programLoggerFactory.getLogger(ParallelWriteSevice.class);
private static ExecutorService executorService=null;
private static List<Future<Boolean>> futures=new ArrayList<Future<Boolean>>();
public static void submit(Callable callable) {
if(executorService==null)
{
executorService=Executors.newFixedThreadPool(15);//Based on target tables increase this
}
futures.add(executorService.submit(callable));
}
public static boolean isWriteSucess() {
boolean writeFailureOccured = false;
try {
for (Future<Boolean> future : futures) {
try {
Boolean writeStatus = future.get();
if (writeStatus == false) {
writeFailureOccured = true;
}
} catch (Exception e) {
logger.error("Erorr - Scdeduled write failed " + e.getMessage(), e);
writeFailureOccured = true;
}
}
} finally {
resetFutures();
if (executorService != null)
executorService.shutdown();
executorService = null;
}
return !writeFailureOccured;
}
private static void resetFutures() {
logger.error("resetFutures called");
//futures.clear();
}
}