How does a sawtooth pattern of Kafka consumer lag emerge? - apache-kafka

Some of my Kafka consumers (but not all) show an interesting pattern regarding their lag.
The following image shows two good examples:
dark-blue:
about 200 messages per second in topic
32 partitions
1 consumer in group (Python client, running on Kubernetes)
light-blue (same topic as dark-blue):
so also about 200 messages per second in topic
so also 32 partitions
1 consumer in group (also a Python client, running on Kubernetes)
brown:
about 1500 messages per second in topic
40 partitions
2 consumers in group (Java/Spring client, running on Kubernetes)
Both sawtoothy clients can handle much larger throughput than that (tested by pausing, resuming and letting them catch up), so they are not working on their limits.
Rebalancing does happen sometimes (according to the logs) but much less often than the jumps in the diagram, and the few events also don't correlate in time with the jumps.
The messages also do not come in batches. Here is the additional information for one of the affected topics:
Where can this pattern originate from?

Just found out that the low-frequency sawtooth pattern is not real. And the explanation is quite interesting. ;)
When I check the consumer lag using the command line (kafka-consumer-groups --bootstrap-server=[...] --group [...] --describe), I see that the total consumer lag (sum of lags per partition) fluctuates very quickly. At one point it's around 6000, 2 seconds later its around 1000, again 2 seconds later it might be 9000.
The graph shown however seems to be based on samples taken with a lower frequency, which violates the Nyquist–Shannon sampling theorem. So the averaging does not work, and we see a Moiré pattern.
Conclusion: The sawtooth pattern is just an illusion.
For completeness, here is a simulation depicting the effect:
#!/usr/bin/env python3
"""Simulate moire effect of Kafka-consumer-lag graph.
"""
import random
import matplotlib.pyplot as plt
def x_noise_sampling() -> int:
return 31 + random.randint(-6, 6)
def main() -> None:
max_x = 7000
sample_rate = 97
xs = list(range(max_x))
ys = [x % 100 for x in xs]
xs2 = [x + x_noise_sampling() for x in range(0, max_x - 100, sample_rate)]
ys2 = [ys[x2] for x2 in xs2]
plt.figure(figsize=(16, 9))
plt.xlabel('Time')
plt.xticks([])
plt.yticks([])
plt.ylabel('Consumer lag')
signal, = plt.plot(xs, ys, '-')
samples, = plt.plot(xs2, ys2, 'bo')
interpolated, = plt.plot(xs2, ys2, '-')
plt.legend([signal, samples, interpolated], ['Signal', 'Samples', 'Interpolated samples'])
plt.savefig('sawtooth_moire.png', dpi=100)
plt.show()
if __name__ == '__main__':
main()

Related

Questiins about DTCM, how to model this process?

In a certain manufacturing system, there are 2 machines M 1 and M 2.
M 1 is a fast and high precision machine whereas M 2 is a slow and low
precision machine. M 2 is employed only when M 1 is down, and it is
assumed that M 2 does not fail. Assume that the processing time of
parts on M 1, the processing time of parts on M 2, the time to failure
of M 1, and the repair time of M 1 are independent geometric random
variables with parameters p 1, p 2, f, and r, respectively. Identify a
suitable state space for the DTMC model of the above system and
compute the TPM. Investigate the steady-state behavior of the DTMC.
How to model this into. DTMC? What is the state space? I have tried to use state-space like this:
0: M1 is working, M1 does not fail
1: M1 failed, M2 is working, M1 is repairing, M1 does not finish repairing
But there are still some problems, like what will happen after M1 finishes 1 part? Immediately process the next one or will change decide whether to fail or not? What will happen if M1 fails during process one part? What is the probability transfer matrix?
Thank you very much for your help!!!!

NTP, Unix and Messaging

I have two machines and both of them are set to UTC by the same NTP server. I have them set up so that Machine A sends messages to Machine B at time T and Machine B receives these messages at T + N. The messages seem to be receivable at Machine B for any positive value of N and Im wondering if anyone can tell me of a way to get Machine B to receive these messages at values of N which are less than zero.
The application of this is for frivolous message passing around magnetised water coolers where varying flux densities are related through Brown Capacitors on the fly.
I dont think this problem is one which can be solved in an instant but might merit some responses from those who know what Im talking about.
You can use the type or code that. Say Machine B = 0 and Machine A = 1 or if you have them both in a way that Time is greater than N so the overall outcome is of T + N is a positive outcome. The second code that I added was to unscramble the message but I couldn't mix it with my old code so if you could tweak it to your code it could fix your problem. So I removed all the unscrambling code and put a list code so when it is transmitted it should be fixed
Message = input("Please put your message")
Machine_B = 0
Machine_A = 1
while Machine_B <= Machine_A:
Machine_B -= 1
if Machine_B = -10
Machine_B += 1
print(Message)
else:
Machine_B -= 1
message1 =[Message]
print(message1)

Can Kmeans total within sum of squares increase with number of clusters?

I am seeing an increase in total within Sum of squares when I am using below code.Is this even possible or I am doing some mistake in code?
v<-foreach(i = 1:30,.combine = c) %dopar% {
iter <- kmeans (clustering_data,centers = i,iter.max = 1000)
iter$tot.withinss
}
K-means is a randomized algorithm. It does not guarantee to find the optimum.
So you simply had a bad random.
Yes. See Anony-Mousse's answer.
If you used the nstart = 25 argument of the kmeans() function, you would run the algorithm 25 times, let R collect the error measures from each run, and build averages internally. This way you do not need to construct a foreach-loop.
from the documentation of R's kmeans()
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
You have to choose a reasonable value for nstart. Then, errors by different random initializations are more likely to have averaged out. (But there is no guarantee that tot.withinss is minimal after nstart runs. )

Apache Spark flatMap time complexity

I've been trying to find a way to count the number of times sets of Strings occur in a transaction database (implementing the Apriori algorithm in a distributed fashion). The code I have currently is as follows:
val cand_br = sc.broadcast(cand)
transactions.flatMap(trans => freq(trans, cand_br.value))
.reduceByKey(_ + _)
}
def freq(trans: Set[String], cand: Array[Set[String]]) : Array[(Set[String],Int)] = {
var res = ArrayBuffer[(Set[String],Int)]()
for (c <- cand) {
if (c.subsetOf(trans)) {
res += ((c,1))
}
}
return res.toArray
}
transactions starts out as an RDD[Set[String]], and I'm trying to convert it to an RDD[(K, V), with K every element in cand and V the number of occurrences of each element in cand in the transaction list.
When watching performance on the UI, the flatMap stage quickly takes about 3min to finish, whereas the rest takes < 1ms.
transactions.count() ~= 88000 and cand.length ~= 24000 for an idea of the data I'm dealing with. I've tried different ways of persisting the data, but I'm pretty positive that it's an algorithmic problem I am faced with.
Is there a more optimal solution to solve this subproblem?
PS: I'm fairly new to Scala / Spark framework, so there might be some strange constructions in this code
Probably, the right question to ask in this case would be: "what is the time complexity of this algorithm". I think it is very much unrelated to Spark's flatMap operation.
Rough O-complexity analysis
Given 2 collections of Sets of size m and n, this algorithm is counting how many elements of one collection are a subset of elements of the other collection, so it looks like complexity m x n. Looking one level deeper, we also see that 'subsetOf' is linear of the number of elements of the subset. x subSet y == x forAll y, so actually the complexity is m x n x s where s is the cardinality of the subsets being checked.
In other words, this flatMap operation has a lot of work to do.
Going Parallel
Now, going back to Spark, we can also observe that this algo is embarrassingly parallel and we can take advantage of Spark's capabilities to our advantage.
To compare some approaches, I loaded the 'retail' dataset [1] and ran the algo on val cand = transactions.filter(_.size<4).collect. Data size is a close neighbor of the question:
Transactions.count = 88162
cand.size = 15451
Some comparative runs on local mode:
Vainilla: 1.2 minutes
Increase transactions partitions up to # of cores (8): 33 secs
I also tried an alternative implementation, using cartesian instead of flatmap:
transactions
.cartesian(candRDD)
.map{case (tx, cd) => (cd, if (cd.subsetOf(tx)) 1 else 0)}
.reduceByKey(_ + _)
.collect
But that resulted in much longer runs as seen in the top 2 lines of the Spark UI (cartesian and cartesian with a higher number of partitions): 2.5 min
Given I only have 8 logical cores available, going above that does not help.
Sanity checks:
Is there any added 'Spark flatMap time complexity'? Probably some, as it involves serializing closures and unpacking collections, but negligible in comparison with the function being executed.
Let's see if we can do a better job: I implemented the same algo using plain scala:
val resLocal = reduceByKey(transLocal.flatMap(trans => freq(trans, cand)))
Where the reduceByKey operation is a naive implementation taken from [2]
Execution time: 3.67 seconds.
Sparks gives you parallelism out of the box. This impl is totally sequential and therefore takes longer to complete.
Last sanity check: A trivial flatmap operation:
transactions
.flatMap(trans => Seq((trans, 1)))
.reduceByKey( _ + _)
.collect
Execution time: 0.88 secs
Conclusions:
Spark is buying you parallelism and clustering and this algo can take advantage of it. Use more cores and partition the input data accordingly.
There's nothing wrong with flatmap. The time complexity prize goes to the function inside it.

Strange Behaviour Using Scala Parallel Collections and setParallelism

I recently found out about Parallel Collection in Scala 2.9 and was excited to see that the degree of parallelism can be set using collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism.
However when I tried an experiment of adding two vectors of size one million each , I find
Using parallel collection with parallelism set to 64 is as fast as sequential (Shown in results).
Increasing setParallelism seems to increase performance in a non-linear way. I would have atleast expected monotonic behaviour (That is performance should not degrade if I increase parallelism)
Can some one explain why this is happening
object examplePar extends App{
val Rnd = new Random()
val numSims = 1
val x = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val y = for(j <- 1 to 1000000) yield Rnd.nextDouble()
val parInt = List(1,2,4,8,16,32,64,128,256)
var avg:Double = 0.0
var currTime:Long = 0
for(j <- parInt){
collection.parallel.ForkJoinTasks.defaultForkJoinPool.setParallelism(j)
avg = 0.0
for (k <- 1 to numSims){
currTime = System.currentTimeMillis()
(x zip y).par.map(x => x._1 + x._2)
avg += (System.currentTimeMillis() - currTime)
}
println("Average Time to execute with Parallelism set to " + j.toString + " = "+ (avg/numSims).toString + "ms")
}
currTime = System.currentTimeMillis()
(x zip y).map(x => x._1 + x._2)
println("Time to execute using Sequential = " + (System.currentTimeMillis() - currTime).toString + "ms")
}
The results on running the example using Scala 2.9.1 and a four core processor is
Average Time to execute with Parallelism set to 1 = 1047.0ms
Average Time to execute with Parallelism set to 2 = 594.0ms
Average Time to execute with Parallelism set to 4 = 672.0ms
Average Time to execute with Parallelism set to 8 = 343.0ms
Average Time to execute with Parallelism set to 16 = 375.0ms
Average Time to execute with Parallelism set to 32 = 391.0ms
Average Time to execute with Parallelism set to 64 = 406.0ms
Average Time to execute with Parallelism set to 128 = 813.0ms
Average Time to execute with Parallelism set to 256 = 469.0ms
Time to execute using Sequential = 406ms
Though these results are for one run, they are consistent when averaged over more runs
Parallelism does not come free. It requires extra cycles to split the problem into smaller chunks, organize everything, and synchronize the result.
You can picture this as calling all your friends to help you move, waiting for them to get there, helping you load the truck, then taking them out to lunch, and finally, getting on with your task.
In your test case you are adding two doubles, which is a trivial exercise and takes so little time that overhead from parallelization is greater than simply doing the task in a one thread.
Again, the analogy would be to call all your friends to help you move 3 suitcases. It would take you half a day to get rid of them, while you could finish by yourself in minutes.
To get any benefit from parallelization your task has to be complicated enough to warrant the extra overhead. Try doing some expensive calculations, for example a formula involving a mix of 5-10 trigonometric and logarithmic functions.
I would suggest studying and using the scala.testing.Benchmark trait to benchmark snippets of code. You have to take JIT, GC and other things into account when benchmarking on the JVM - see this paper. In short, you have to do each of the runs in the separate JVM after doing a couple of warm-up runs.
Also, note that the (x zip y) part does not occur in parallel, because x and y are not yet parallel - the zip is done sequentially. Next, I would suggest turning x and y into arrays (toArray) and then calling par - this will ensure that the benchmark uses parallel arrays, rather than parallel vectors (which are slower for transformer methods such as zip and map).