Kafka Streams: group by subsequent identical keys and time windows

Kafka Streams: group by subsequent identical keys and time windows - apache-kafka

I have a scenario that could be described like this. Imagine there are 2 types of keys coming in: A and B (in reality, there are more). Let's say (different) records enter the KStream in following order:
A1
A2
A3
B1
B2
A4
A5
In my scenario, the order of operations is important and I cannot process A4 before B1 or B2. However, I'd like to be able to process the records as batches. The way I see it, the best option to batch this input looks like this, i.e. reduce the input to 3 "batch" objects:
[A1 A2 A3]
[B1 B2]
[A4 A5]
I could then apply a function to each batch using forEach.
Complexity: The application is time-sensitive and it's not acceptable to keep aggregating records until its type changes (and a new batch is required). In other words, if the time between A1 and A2 is more than some time t, a batch with A1 should be generated when t expires. The output then look like this (assuming all other records entered the stream in close succession):
[A1]
[A2 A3]
[B1 B2]
[A4 A5]
Question: How do I obtain a KStream with such batch objects while also taking time windows into account?
My initial solution (which probably doesn't work) for the final scenario (with delay between A1 and A2):
[KStream] incoming data with 2 possible keys: A or B, e.g. [ (A, A1), (A, A2), ...]
|
| selectKey(key + something to separate A1, A2, A3 from A4 and A5 because B1 and B2 are inbetween)
v
[KStream] e.g. [ (A-group1, A1), (A-group1, A2), ... , (B-group2, B1), ..., (A-group3, A4) ]
|
| groupBy(key) // So either A-group1, B-group2 or A-group3
v
[KGroupedStream] 3 different streams
|
| WindowedBy e.g. 1 second
v
[TimeWindowedKStream] (still 3 different streams I guess?)
|
| reduce() --> Make "batch" objects out of window, e.g. a batch object is [A2, A3]
v
[KTable] (no idea how this look like, I guess one row per time window?)
|
| toStream()
v
[KStream] 1 stream with 4 entries like I described in the final scenario above
Could this work? Is this efficient? What are your thoughts?

Related

Parallelizing sequential for-loop for GPU

I have a for-loop where the current index of a vector depends on the previous indices that I am trying to parallelize for a GPU in MATLAB.
A is an nx1 known vector
B is an nx1 output vector that is initialized to zeros.
The code is as follows:
for n = 1:size(A)
B(n+1) = B(n) + A(n)*B(n) + A(n)^k + B(n)^2
end
I have looked at this similar question and tried to find a simple closed form for the recurrence relation, but couldn't find one.
I could do a prefix sum as mentioned in the first link over the A(n)^k term, but I was hoping there would be another method to speed up the loop.
Any advice is appreciated!
P.S. My real code involves 3D arrays that index and sum along 2D slices, but any help for the 1D case should transfer to a 3D scaling.

A word "Parallelizing" sounds magically, but scheduling rules apply:
Your problem is not in spending efforts on trying to convert a pure SEQ-process into it's PAR-re-representation, but in handling the costs of doing so, if you indeed persist into going PAR at any cost.
m = size(A); %{
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
const A[] := | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | | M |
+---+---+---+---+---+---+---+---+---+---+---+---+---+ .. +---+
:
\
\
\
\
\
\
\
\
\
\
\
+---+---+---+---+---+ .. + .. +---+---+---+---+---+ .. +---+
var B[] := | 0 | 0 | 0 | 0 | 0 | : | 0 | 0 | 0 | 0 | 0 | | 0 |
+---+---+---+---+---+ .. : .. +---+---+---+---+---+ .. +---+ }%
%% : ^ :
%% : | :
for n = 1:m %% : | :
B(n+1) =( %% ====:===+ : .STO NEXT n+1
%% : :
%% v :
B(n)^2 %% : { FMA B, B, .GET LAST n ( in SEQ :: OK, local data, ALWAYS )
+ B(n) %% v B } ( in PAR :: non-local data. CSP + bcast + many distributed-caches invalidates )
+ B(n) * A(n) %% { FMA B, A,
+ A(n)^k %% ApK}
);
end
Once the SEQ-process data-dependency is recurrent ( having a need to re-use the LAST B(n-1) for an assignment of the NEXT B(n), any attempt to make such SEQ calculation work in PAR will have to introduce a system-wide communication of known values, before "new"-values could get computed only after the respective "previous" B(n-1) has been evaluated and assigned -- through the pure serial SEQ chain of recurrent evaluation, thus not before all the previous cell have been processed serially -- as the LAST piece is always needed for the NEXT step , ref. the "crossroads" in for()-loop iterator dependency-map ( having this, all the rest have to wait in a "queue", to become able do their two primitive .FMA-s + .STO result for the next one in the recurrency indoctrinated "queue" ).
Yes, one can "enforce" the formula to become PAR-executed, but the very costs of such LAST values being communicated "across" the PAR-execution fabric ( towards the NEXT ) is typically prohibitively expensive ( in terms of resources and accrued delays -- either damaging the SIMT-optimised scheduler latency-masking, or blocking all the threads until receiving their "neighbour"-assigned LAST-value that they rely on and cannot proceed without getting it first -- either of which effectively devastates any potential benefit from all the efforts invested into going PAR ).
Even just a pair of FMA-s is not enough code to justify add-on costs -- indeed an extremely small amount of work to do -- for all the PAR efforts.
Unless some very mathematically "dense" processing is being in place, all the additional costs do not get easily adjusted and such attempt to introduce a PAR-mode of computing exhibits nothing but a negative ( adverse ) effect, instead of any wished speedup. One ought, in all professional cases, express all add-on costs during the Proof-of-Concept phase ( a PoC ), before deciding whether any feasible PAR-approach is possible at all, and how to achieve a speedup of >> 1.0 x
Relying on just advertised theoretical GFLOPS and TFLOPS is a nonsense. Your actual GPU-kernel will never be able to repeat the advertised tests' performance figures ( unless you run exactly the same optimised layout and code, which one does not need, does one? ). One typically needs to compute one's own specific algorithmisation, that is related to one's problem domain, not willing to artificially align all the toy-problem elements so that the GPU-silicon will not have to wait for real data and can enjoy some tweaked cache/register based ILP-artifacts, practically not achievable in most of the real-world problem solutions ). If there is one step to recommend -- do always evaluate overhead-fair PoC first to see, if there exists any such chance for speedup, before diving any resources and investing time and money into prototyping detailed design & testing.
Recurrent and weak processing GPU kernel-payloads almost in every case will fight hard to get at least their additional overhead-times ( bidirectional data-transfer related ( H2D + D2H ) + kernel-code related loads ) adjusted.

how to handle type F in a table in Q/KDB

I have started to learn q/KDB since a while, therefore forgive me in advance for trivial question but I am facing the following problem I don't know how to solve.
I have a table named "res" showing, side, summation of orders and average_price of some simbols
sym side | sum_order avg_price
----------| -------------------
ALPHA B | 95109 9849.73
ALPHA S | 91662 9849.964
BETA B | 47 9851.638
BETA S | 60 9853.383
with these types
c | t f a
---------| -----
sym | s p
side | s
sum_order| f
avg_price| f
I would like to calculate close and open positions, average point, made by close position, and average price of the open position.
I have used this query which I believe it is pretty bizarre (I am sure there will be a more professional way to do it) but it works as expected
position_summary:select
close_position:?[prev[sum_order]>sum_order;sum_order;prev[sum_order]],
average_price:avg_price-prev[avg_price],
open_pos:prev[sum_order]-sum_order,
open_wavgprice:?[sum_order>next[sum_order];avg_price;next[avg_price]][0]
by sym from res
giving me the following table
sym | close_position average_price open_pos open_wavgprice
----------| ----------------------------------------------------
ALPHA | 91662 0.2342456 3447 9849.73
BETA | 47 1.745035 -13 9853.38
and types are
c | t f a
--------------| -----
sym | s s
close_position| F
average_price | F
open_pos | F
open_wavgprice| f
Now my problem starts here, imagine I join position_summary table with another table appending another column "current_price" of type f
What I want to do is to determinate the points of the open positions.
I have tried this way:
select
?[open_pos>0;open_price-open_wavgprice;open_wavgprice-open]
from position_summary
but I got 'type error,
surely because sum_order is type F and open_wavgprice and current_price are f. I have search on internet by I did not find much about F type.
First: how can I handle this ? I have tried "cast" or use "raze" but no effects and moreover I am not sure if they are right on this particular occasion.
Second: is there a better way to use "if-then" during query tables (for example, in plain English :if this row of this column then take the previous / next of another column or the second or third of previous /next column)
Thank you for you help

Let me rephrase your question using a slightly simpler table:
q)show res:([sym:`A`A`B`B;side:`B`S`B`S]size:95 91 47 60;price:49.7 49.9 51.6 53.3)
sym side| size price
--------| ----------
A B | 95 49.7
A S | 91 49.9
B B | 47 51.6
B S | 60 53.3
You are trying to find the closing position for each symbol using a query like this:
q)show summary:select close:?[prev[size]>size;size;prev[size]] by sym from res
sym| close
---| -----
A | 91
B | 47
The result seems to have one number in each row of the "close" column, but in fact it has two. You may notice an extra space before each number in the display above or you can display the first row
q)first 0!summary
sym | `A
close| 0N 91
and see that the first row in the "close" column is 0N 91. Since the missing values such as 0N are displayed as a space, it was hard to see them in the earlier display.
It is not hard to understand how you've got these two values. Since you select by sym, each column gets grouped by symbol and for the symbol A, you have
q)show size:95 91
95 91
and
q)prev size
0N 95
that leads to
q)?[prev[size]>size;size;prev[size]]
0N 91
(Recall that 0N is smaller than any other integer.)
As a side note, ?[a>b;b;a] is element-wise minimum and can be written as a & b in q, so your conditional expression could be written as
q)size & prev size
0N 91
Now we can see why ? gave you the type error
q)close:exec close from summary
q)close
91
47
While the display is deceiving, "close" above is a list of two vectors:
q)first close
0N 91
and
q)last close
0N 47
The vector conditional does not support that:
q)?[close>0;10;20]
'type
[0] ?[close>0;10;20]
^
One can probably cure that by using each:
q)?[;10;20]each close>0
20 10
20 10
But I don't think this is what you want. Your problem started when you computed the summary table. I would expect the closing position to be the sum of "B" orders minus the sum of "S" orders that can be computed as
q)select close:sum ?[side=`B;size;neg size] by sym from res
sym| close
---| -----
A | 4
B | -13
Now you should be able to fix the rest of the columns in your summary query. Just make sure that you use an aggregation function such as sum in the expression for every column.

Type F means the "cell" in the column contains a vector of floats rather than an atom. So your column is actually a vector of vectors rather than a flat vector.
In your case you have a vector of size 1 in each cell, so in your case you could just do:
select first each close_position, first each average_price.....
which will give you a type f.
I'm not 100% on what you were trying to do in the first query, and I don't have a q terminal to hand to check but you could put this into your query:
select close_position:?[prev[sum_order]>sum_order;last sum_order; last prev[sum_order].....
i.e. get the last sum_order in the list.

RX and buffering

I'm trying to obtain the following observable (with a buffer capacity of 10 ticks):
Time 0 5 10 15 20 25 30 35 40
|----|----|----|----|----|----|----|----|
Source A B C D E F G H
Result A E H
B F
C G
D
Phase |<------->|-------|<------->|<------->|
B I B B
That is, the behavior is very similar to the Buffer observable with the difference that the buffering phase is not in precise time slot, but starts at the first symbol pushed in the idle phase. I mean, in the example above the buffering phases start with the 'A', 'E', and 'H' symbols.
Is there a way to compose the observable or do I have to implement it from scratch?
Any help will be appreciated.

Try this:
IObservable<T> source = ...;
IScheduler scheduler = ...;
IObservable<IList<T>> query = source
.Publish(obs => obs
.Buffer(() => obs.Take(1).IgnoreElements()
.Concat(Observable.Return(default(T)).Delay(duration, scheduler))
.Amb(obs.IgnoreElements())));
The buffer closing selector is called once at the start and then once whenever a buffer closes. The selector says "The buffer being started now should be closed duration after the first element of this buffer, or when the source completes, whichever occurs first."
Edit: Based on your comments, if you want to make multiple subscriptions to query share a single subscription to source, you can do that by appending .Publish().RefCount() to the query.
IObservable<IList<T>> query = source
.Publish(obs => obs
.Buffer(() => obs.Take(1).IgnoreElements()
.Concat(Observable.Return(default(T)).Delay(duration, scheduler))
.Amb(obs.IgnoreElements())));
.Publish()
.RefCount();

Unexpected spark caching behavior

I've got a spark program that essentially does this:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.
Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.
Am I misunderstanding how to get Spark to cache RDDs?
Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:
a b 1000.0
a c 1000.0
b c 1000.0
d e 1000.0
d f 1000.0
e f 1000.0
g h 1000.0
h i 1000.0
g i 1000.0
d g 400.0
Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.

The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.
Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!
I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
next.foreachPartition(x => {}) // materialize before unpersisting
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}

Caching doesn't reduce stages, it just won't recompute the stage every time.
In the first iteration, in the stage's "Input Size" you can see that the data is coming from Hadoop, and that it reads shuffle input. In subsequent iterations, the data is coming from memory and no more shuffle input. Also, execution time is vastly reduced.
New map stages are created whenever shuffles have to be written, for example when there's a change in partitioning, in your case adding a key to the RDD.

Add a column of differences to tables of summary statistics in Stata

If I make a two way summary statistics table in Stata using table, can I add another column that is the difference of two other columns?
Say that I have three variables (a, b, c). I generate quintiles on a and b then generate a two-way table of means of c in each quintile-quintile intersection. I would like to generate a sixth column that is the difference of mean c between the top and bottom quintiles of b for each quintile of a.
I can generate the table of mean c for each quintile-quintile intersection, but I can't figure out the difference column.
* generate data
clear
set obs 2000
generate a = rnormal()
generate b = rnormal()
generate c = rnormal()
* generate quantiles for for a and b
xtile a_q = a, nquantiles(5)
xtile b_q = b, nquantiles(5)
* calculate the means of each quintile intersection
table a_q b_q, c(mean c)
* if I want the top and bottom b quantiles
table a_q b_q if b_q == 1 | b_q == 5, c(mean c)
Update: Here's an example of what I would like to do.

With the collapse command you can create customized tables like the one you have in mind.
preserve
collapse (mean) c, by(a_q b_q)
keep if inlist(b_q, 1, 5)
reshape wide c, i(a_q) j(b_q)
gen c5_c1 = c5 - c1
set obs `=_N + 1'
replace c1 = c1[`=_N - 1'] - c1[1] if mi(a_q)
replace c5 = c5[`=_N - 1'] - c5[1] if mi(a_q)
replace c5_c1 = c5_c1[`=_N - 1'] - c5_c1[1] if mi(a_q)
list, sep(0) noobs
restore
Then you should obtain something like this in your output:
+-----------------------------------------+
| a_q c1 c5 c5_c1 |
|-----------------------------------------|
| 1 .2092651 .1837719 -.0254932 |
| 2 .0256483 -.0118134 -.0374617 |
| 3 .022957 .0586441 .0356871 |
| 4 .0431809 .0876745 .0444935 |
| 5 -.0859874 .0199202 .1059076 |
| . -.2952525 -.1638517 .1314008 |
+-----------------------------------------+
If you are not very familiar with Stata, the following help pages might be useful in understanding the code
help _variables
help subscripting

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka Streams: group by subsequent identical keys and time windows - apache-kafka

Related

Parallelizing sequential for-loop for GPU

how to handle type F in a table in Q/KDB

RX and buffering

Unexpected spark caching behavior

Add a column of differences to tables of summary statistics in Stata

Categories

Resources