Kafka Streams - Filter messages that appear frequently in a time window - apache-kafka

I am trying to filter for any messages whose key appears more often than a threshold N in a given (hopping) time window of length T.
For example, in the following stream:
#time, key
0, A
1, B
2, A
3, C
4, D
5, A
6, B
7, C
8, C
9, D
10, A
11, D
12, D
13, D
14, D
15, D
and N=2 and T=3, the outcome should be
0, A
2, A
7, C
8, C
9, D
11, D
12, D
13, D
14, D
15, D
Alternatively, if the above is not possible, a simplification would be only to filter for the messages after the threshold has been met:
#time, key
2, A
8, C
11, D
12, D
13, D
14, D
15, D
Is this possible with Kafka Streams?
So far I have tried to create a windowed count (instance of KTable) of the stream and join it back to the original stream. I change the key of the windowed count back to the original key using KTable#toStream((k,v) -> k.key()) and performing a dummy aggregation back to an instance of KTable. This seems to introduce a delay which causes the leftJoin to miss messages which come very close after the threshold is exceeded.
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
KStream<String, Long> wcount = source.groupByKey()
.count(TimeWindows.of(TimeUnit.SECONDS.toMillis(5)),"Counts")
.toStream((k,v) -> k.key());
// perform dummy aggregation to get KTable
KTable<String, Long> wcountTable = wcount.groupByKey(stringSerde, longSerde)
.reduce((aggValue, newValue) -> newValue,
"dummy-aggregation-store");
// left join and filter with threshold N=1
source.leftJoin(wcountTable, (leftValue, rightValue) -> rightValue,stringSerde, stringSerde )
.filter((k,v) -> v!=null)
.filter((k,v) -> v>1)
.print("output");
I have also tried to perform a KStream-KStream join with an appropriate window (leaving out the dummy aggregation):
source.join(wcount, (leftValue, rightValue) -> rightValue, JoinWindows.of(TimeUnit.SECONDS.toMillis(5)),stringSerde, stringSerde, longSerde)
.filter((k,v) -> v!=null)
.filter((k,v) -> v>1)
.print("output");
This results in duplicate outputs since each UPSERT into wcount triggers an event.

This is certainly possible. You can apply a windowed aggregation that collect all raw data in a list (ie, you manually materialize the window). Afterwards, you apply a flatMap that evaluates the window. If the threshold is not met yet, you emit nothing. If the threshold is met for the first time, you emit all buffered data. For all further calls of flatMap with a larger count than the threshold, you only emit the latest one in the list (you know that you did emit all others an the call to flatMap before, ie, emit only the newly added one).
Note: you need to disable KTable cache, ie, set config parameter "cache.max.bytes.buffering" = 0. Otherwise, the algorithms won't work correctly.
Something like this:
KStream<Windowed<K>, List<V>> windows = stream.groupByKey()
.aggregate(
/*init with empty list*/,
/*add value to list in agg*/,
TimeWindows.of()...),
...)
.toStream();
KStream<K,V> thresholdMetStream = windows.flatMap(
/* if List#size < threshold
then return empty-list, ie, nothing
elseif List#size == threshold
then return whole list
else [List#size > threshold]
then return last element from list
*/);

AFAIK this is the perfect fit for the Count-Min-Sketch algorithm. See for example the stream-lib implementation:
https://github.com/addthis/stream-lib

Related

Scala Spark : Difference in the results returned by df.stat.sampleBy()

I saw many questions posted on stratifiedSampling, but none of them answered my question, so asking it as “new post” , hoping to get some update.
I have noticed that there is a difference in results returned by spark API:sampleBy(), this is not much significant for small sized dataframe but is more noticeable for large sized data frame (>1000 rows)
sample code:
val inputRDD:RDD[(Any,Row)] =df.rdd.keyBy(x=> x.get(0))
val keyCount = inputRDD.countByKey()
val sampleFractions = keyCount.map(x => (x._1,{(

 x._2.toDouble*sampleSize)/(totalCount*100)})).toMap
val sampleDF = df.stat.sampleBy(cols(0),fractions = sampleFractions,seed = 11L)
total dataframe count:200
Keys count:
A:16
B:91
C:54
D:39
fractions : Map(A -> 0.08, B -> 0.455, C -> 0.27, D -> 0.195)
I get only 69 rows as output from df.stat.sampleBy() though I have specified that sample size expected is 100, of course this is specified as fraction to spark API.
Thanks
sampleBy doesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions. Depending on a run this value will vary and there is nothing unusual about it.
The result is combined from A -> 16 * 0.08, B -> 91 * 0.455, C -> 54 * 0.27, D -> 39 * 0.195 = ( 1.28 rows + 41.405 rows + 14.58 rows + 7.605 rows) which will make around 67 rows

Torch: back-propagation from loss computed over a subset of the output

I have a simple convolutional neural network, whose output is a single channel 4x4 feature map. During training, the (regression) loss needs to be computed only on a single value among the 16 outputs. The location of this value will be decided after the forward pass. How do I compute the loss from just this one output, while making sure all irrelevant gradients are zero'ed out during back-prop.
Let's say I have the following simple model in torch:
require 'nn'
-- the input
local batch_sz = 2
local x = torch.Tensor(batch_sz, 3, 100, 100):uniform(-1,1)
-- the model
local net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 128, 9, 9, 9, 9, 1, 1))
net:add(nn.SpatialConvolution(128, 1, 3, 3, 3, 3, 1, 1))
net:add(nn.Squeeze(1, 3))
print(net)
-- the loss (don't know how to employ it yet)
local loss = nn.SmoothL1Criterion()
-- forward'ing x through the network would result in a 2x4x4 output
y = net:forward(x)
print(y)
I have looked at nn.SelectTable and it seems like if I convert the output into tabular form I would be able to implement what I want?
This is my current solution. It works by splitting the output into a table, and then using nn.SelectTable():backward() to get the full gradient:
require 'nn'
-- the input
local batch_sz = 2
local x = torch.Tensor(batch_sz, 3, 100, 100):uniform(-1,1)
-- the model
local net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 128, 9, 9, 9, 9, 1, 1))
net:add(nn.SpatialConvolution(128, 1, 3, 3, 3, 3, 1, 1))
net:add(nn.Squeeze(1, 3))
-- convert output into a table format
net:add(nn.View(1, -1)) -- vectorize
net:add(nn.SplitTable(1, 1)) -- split all outputs into table elements
print(net)
-- the loss
local loss = nn.SmoothL1Criterion()
-- forward'ing x through the network would result in a (2)x4x4 output
y = net:forward(x)
print(y)
-- returns the output table's index belonging to specific location
function get_sample_idx(feat_h, feat_w, smpl_idx, feat_r, feat_c)
local idx = (smpl_idx - 1) * feat_h * feat_w
return idx + feat_c + ((feat_r - 1) * feat_w)
end
-- I want to back-propagate the loss of this sample at this feature location
local smpl_idx = 2
local feat_r = 3
local feat_c = 4
-- get the actual index location in the output table (for a 4x4 output feature map)
local out_idx = get_sample_idx(4, 4, smpl_idx, feat_r, feat_c)
-- the (fake) ground-truth
local gt = torch.rand(1)
-- compute loss on the selected feature map location for the selected sample
local err = loss:forward(y[out_idx], gt)
-- compute loss gradient, as if there was only this one location
local dE_dy = loss:backward(y[out_idx], gt)
-- now convert into full loss gradient (zero'ing out irrelevant losses)
local full_dE_dy = nn.SelectTable(out_idx):backward(y, dE_dy)
-- do back-prop through who network
net:backward(x, full_dE_dy)
print("The full dE/dy")
print(table.unpack(full_dE_dy))
I would really appreciate it somebody points out a simpler OR more efficient method.

JAGS - hierarchical model comparison not jumping between models even with pseudopriors

I'm using the hierarchical modelling framework described by Kruschke to set up a comparison between two models in JAGS. The idea in this framework is to run and compare multiple versions of a model, by specifying each version as one level of a categorical variable. The posterior distribution of this categorical variable then can be interpreted as the relative probability of the various models.
In the code below, I'm comparing two models. The models are identical in form. Each has a single parameter that needs to be estimated, mE. As can be seen, the models differ in their priors. Both priors are distributed as beta distributions that have a mode of 0.5. However, the prior distribution for model 2 is a much more concentrated. Note also that I've used pseudo priors that I had hoped would keep the chains from getting stuck on one of the models. But the model seems to get stuck anyway.
Here is the model:
model {
m ~ dcat( mPriorProb[] )
mPriorProb[1] <- .5
mPriorProb[2] <- .5
omegaM1[1] <- 0.5 #true prior
omegaM1[2] <- 0.5 #psuedo prior
kappaM1[1] <- 3 #true prior for Model 1
kappaM1[2] <- 5 #puedo prior for Model 1
omegaM2[1] <- 0.5 #psuedo prior
omegaM2[2] <- 0.5 #true prior
kappaM2[1] <- 5 #puedo prior for Model 2
kappaM2[2] <- 10 #true prior for Model 2
for ( s in 1:Nsubj ) {
mE1[s] ~ dbeta(omegaM1[m]*(kappaM1[m]-2)+1 , (1-omegaM1[m])*(kappaM1[m]-2)+1 )
mE2[s] ~ dbeta(omegaM2[m]*(kappaM2[m]-2)+1 , (1-omegaM2[m])*(kappaM2[m]-2)+1 )
mE[s] <- equals(m,1)*mE1[s] + equals(m,2)*mE2[s]
z[s] ~ dbin( mE[s] , N[s] )
}
}
Here is R code for the relevant data:
dataList = list(
z = c(81, 59, 36, 18, 28, 59, 63, 57, 42, 28, 47, 55, 38,
30, 22, 32, 31, 30, 32, 33, 32, 26, 13, 33, 30),
N = rep(96, 25),
Nsubj = 25
)
When I run this model, the MCMC spends every single iteration at m = 1, and never jumps over to m = 2. I've tried lots of different combinations of priors and pseudo priors, and can't seem to find a combination in which the MCMC will consider m = 2. I've even tried specifying identical priors and pseudo priors for models 1 and 2, but this was no help. In this situation, I would expect the MCMC to jump fairly frequently between models, spending about half the time considering one model, and half the time considering the other. However, JAGS still spent the whole time at m = 1. I've run chains as long as 6000 iterations, which should be more than long enough for a simple model like this.
I would really appreciate if anyone has any thoughts on how to resolve this issue.
Cheers,
Tim
I haven't been able to figure this out, but I thought that anybody else who works on this might appreciate the following code, which will reproduce the problem start-to-finish from R with rjags (must have JAGS installed).
Note that since there are only two competing models in this example, I changed m ~ dcat() to m ~ dbern(), and then replaced m with m+1 everywhere else in the code. I hoped this might ameliorate the behavior, but it did not. Note also that if we specify the initial value for m, it stays stuck at that value regardless of which value we pick, so m just fails to get updated properly (instead of getting weirdly attracted to one model or the other). A head-scratcher for me; could be worth posting for Martyn's eyes at http://sourceforge.net/p/mcmc-jags/discussion/
library(rjags)
load.module('glm')
dataList = list(
z = c(81, 59, 36, 18, 28, 59, 63, 57, 42, 28, 47, 55, 38,
30, 22, 32, 31, 30, 32, 33, 32, 26, 13, 33, 30),
N = rep(96, 25),
Nsubj = 25
)
sink("mymodel.txt")
cat("model {
m ~ dbern(.5)
omegaM1[1] <- 0.5 #true prior
omegaM1[2] <- 0.5 #psuedo prior
kappaM1[1] <- 3 #true prior for Model 1
kappaM1[2] <- 5 #puedo prior for Model 1
omegaM2[1] <- 0.5 #psuedo prior
omegaM2[2] <- 0.5 #true prior
kappaM2[1] <- 5 #puedo prior for Model 2
kappaM2[2] <- 10 #true prior for Model 2
for ( s in 1:Nsubj ) {
mE1[s] ~ dbeta(omegaM1[m+1]*(kappaM1[m+1]-2)+1 , (1-omegaM1[m+1])*(kappaM1[m+1]-2)+1 )
mE2[s] ~ dbeta(omegaM2[m+1]*(kappaM2[m+1]-2)+1 , (1-omegaM2[m+1])*(kappaM2[m+1]-2)+1 )
z[s] ~ dbin( (1-m)*mE1[s] + m*mE2[s] , N[s] )
}
}
", fill=TRUE)
sink()
inits <- function(){list(m=0)}
params <- c("m")
nc <- 1
n.adapt <-100
n.burn <- 200
n.iter <- 5000
thin <- 1
mymodel <- jags.model('mymodel.txt', data = dataList, inits=inits, n.chains=nc, n.adapt=n.adapt)
update(mymodel, n.burn)
mymodel_samples <- coda.samples(mymodel,params,n.iter=n.iter, thin=thin)
summary(mymodel_samples)
The trick is not assigning a fixed probability for the model, but rather estimating it (phi below) based on a uniform prior. You then want the posterior distribution for phi as that tells you the probability of selecting model 2 (ie, a "success" means m=1; Pr(model 1) = 1-phi).
sink("mymodel.txt")
cat("model {
m ~ dbern(phi)
phi ~ dunif(0,1)
omegaM1[1] <- 0.5 #true prior
omegaM1[2] <- 0.5 #psuedo prior
kappaM1[1] <- 3 #true prior for Model 1
kappaM1[2] <- 5 #puedo prior for Model 1
omegaM2[1] <- 0.5 #psuedo prior
omegaM2[2] <- 0.5 #true prior
kappaM2[1] <- 5 #puedo prior for Model 2
kappaM2[2] <- 10 #true prior for Model 2
for ( s in 1:Nsubj ) {
mE1[s] ~ dbeta(omegaM1[m+1]*(kappaM1[m+1]-2)+1 , (1-omegaM1[m+1])*(kappaM1[m+1]-2)+1 )
mE2[s] ~ dbeta(omegaM2[m+1]*(kappaM2[m+1]-2)+1 , (1-omegaM2[m+1])*(kappaM2[m+1]-2)+1 )
z[s] ~ dbin( (1-m)*mE1[s] + m*mE2[s] , N[s] )
}
}
", fill=TRUE)
sink()
inits <- function(){list(m=0)}
params <- c("phi")
See my comment above on Mark S's answer.
This answer is to show by example why we want inference on m and not on phi.
Imagine we have a model given by
data <- c(-1, 0, 1, .5, .1)
m~dbern(phi)
data[i] ~ m*dnorm(0, 1) + (1-m)*dnorm(100, 1)
Now, it is obvious that the true value of m is 1. But what do we know about the true value of phi? Obviously higher values of phi are more likely, but we don't actually have good evidence to rule out lower values of phi. For example, phi=0.1 still has a 10% chance of yielding m=1; and phi=0.5 still has a 50% chance of yielding m=1. So we don't have good evidence against fairly low values of phi, even though we have ironclad evidence that m=1. We want inference on m.

Merging two sorted lists, one with additional 0s

Consider the following problem:
We are given two arrays A and B such that A and B are sorted
except A has B.length additional 0s appended to its end. For instance, A and B could be the following:
A = [2, 4, 6, 7, 0, 0, 0]
B = [1, 7, 9]
Our goal is to create one sorted list by inserting each entry of B
into A in place. For instance, running the algorithm on the above
example would leave
A = [1, 2, 4, 6, 7, 7, 9]
Is there a clever way to do this in better than O(n^2) time? The only way I could think of is to insert each element of B into A by scanning linearly and performing the appropriate number of shifts, but this leads to the O(n^2) solution.
Some pseudo-code (sorta C-ish), assuming array indexing is 0-based:
pA = A + len(A) - 1;
pC = pA; // last element in A
while (! *pA) --pA; // find the last non-zero entry in A
pB = B + len(B) - 1;
while (pA >= A) && (pB >= B)
if *pA > *pB
*pC = *pA; --pA;
else
*pC = *pB; --pB;
--pC
while (pB >= B) // still some bits in B to copy over
*pC = *pB; --pB; --pC;
Not really tested, and just written off the top of my head, but it should give you the idea... May not have the termination and boundary conditions exactly right.
You can do it in O(n).
Work from the end, moving the largest element towards the end of A. This way you avoid a lot of trouble to do with where to keep the elements while iterating. This is pretty easy to implement:
int indexA = A.Length - B.Length - 1;
int indexB = B.Length - 1;
int insertAt = A.Length;
while (indexA > 0 || indexB > 0)
{
insertAt--;
A[insertAt] = max(B[indexB], A[indexA]);
if (A[indexA] <= B[indexB])
indexB--;
else
indexA--;
}

How to check if a number can be represented as a sum of some given numbers

I've got a list of some integers, e.g. [1, 2, 3, 4, 5, 10]
And I've another integer (N). For example, N = 19.
I want to check if my integer can be represented as a sum of any amount of numbers in my list:
19 = 10 + 5 + 4
or
19 = 10 + 4 + 3 + 2
Every number from the list can be used only once. N can raise up to 2 thousand or more. Size of the list can reach 200 integers.
Is there a good way to solve this problem?
4 years and a half later, this question is answered by Jonathan.
I want to post two implementations (bruteforce and Jonathan's) in Python and their performance comparison.
def check_sum_bruteforce(numbers, n):
# This bruteforce approach can be improved (for some cases) by
# returning True as soon as the needed sum is found;
sums = []
for number in numbers:
for sum_ in sums[:]:
sums.append(sum_ + number)
sums.append(number)
return n in sums
def check_sum_optimized(numbers, n):
sums1, sums2 = [], []
numbers1 = numbers[:len(numbers) // 2]
numbers2 = numbers[len(numbers) // 2:]
for sums, numbers_ in ((sums1, numbers1), (sums2, numbers2)):
for number in numbers_:
for sum_ in sums[:]:
sums.append(sum_ + number)
sums.append(number)
for sum_ in sums1:
if n - sum_ in sums2:
return True
return False
assert check_sum_bruteforce([1, 2, 3, 4, 5, 10], 19)
assert check_sum_optimized([1, 2, 3, 4, 5, 10], 19)
import timeit
print(
"Bruteforce approach (10000 times):",
timeit.timeit(
'check_sum_bruteforce([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 200)',
number=10000,
globals=globals()
)
)
print(
"Optimized approach by Jonathan (10000 times):",
timeit.timeit(
'check_sum_optimized([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 200)',
number=10000,
globals=globals()
)
)
Output (the float numbers are seconds):
Bruteforce approach (10000 times): 1.830944365834205
Optimized approach by Jonathan (10000 times): 0.34162875449254027
The brute force approach requires generating 2^(array_size)-1 subsets to be summed and compared against target N.
The run time can be dramatically improved by simply splitting the problem in two. Store, in sets, all of the possible sums for one half of the array and the other half separately. It can now be determined by checking for every number n in one set if the complementN-n exists in the other set.
This optimization brings the complexity down to approximately: 2^(array_size/2)-1+2^(array_size/2)-1=2^(array_size/2 + 1)-2
Half of the original.
Here is a c++ implementation using this idea.
#include <bits/stdc++.h>
using namespace std;
bool sum_search(vector<int> myarray, int N) {
//values for splitting the array in two
int right=myarray.size()-1,middle=(myarray.size()-1)/2;
set<int> all_possible_sums1,all_possible_sums2;
//iterate over the first half of the array
for(int i=0;i<middle;i++) {
//buffer set that will hold new possible sums
set<int> buffer_set;
//every value currently in the set is used to make new possible sums
for(set<int>::iterator set_iterator=all_possible_sums1.begin();set_iterator!=all_possible_sums1.end();set_iterator++)
buffer_set.insert(myarray[i]+*set_iterator);
all_possible_sums1.insert(myarray[i]);
//transfer buffer into the main set
for(set<int>::iterator set_iterator=buffer_set.begin();set_iterator!=buffer_set.end();set_iterator++)
all_possible_sums1.insert(*set_iterator);
}
//iterator over the second half of the array
for(int i=middle;i<right+1;i++) {
set<int> buffer_set;
for(set<int>::iterator set_iterator=all_possible_sums2.begin();set_iterator!=all_possible_sums2.end();set_iterator++)
buffer_set.insert(myarray[i]+*set_iterator);
all_possible_sums2.insert(myarray[i]);
for(set<int>::iterator set_iterator=buffer_set.begin();set_iterator!=buffer_set.end();set_iterator++)
all_possible_sums2.insert(*set_iterator);
}
//for every element in the first set, check if the the second set has the complemenent to make N
for(set<int>::iterator set_iterator=all_possible_sums1.begin();set_iterator!=all_possible_sums1.end();set_iterator++)
if(all_possible_sums2.find(N-*set_iterator)!=all_possible_sums2.end())
return true;
return false;
}
Ugly and brute force approach:
a = [1, 2, 3, 4, 5, 10]
b = []
a.size.times do |c|
b << a.combination(c).select{|d| d.reduce(&:+) == 19 }
end
puts b.flatten(1).inspect