I am using the match_recognize syntax when doing CEP querying with esper. I noticed that after matching some events it ignores them for future matches. for example if using the simple following pattern:
select * from Event
match_recognize (
measures A as a, B as b, C as c
pattern (A B C)
)
it would match the events number 1,2 and 3 in the stream. After that it would match the events number 4,5 and 6. But I want it to match 1,2,3 and then events number 2,3,4 and then 3,4,5 and so forth (of course I'll add more conditions later).
Is there some simple adjustement to this syntax that could do that?
Look at after match skip in the syntax. doc link
match_recognize (
...
after match skip to current row
pattern (...)
)
Related
I have a complex code and I am using when to make a new column under some conditions. Consider the following code:
df.select(
'*',
F.when((A)|(B)|(C),top_val['val']).alias('match'))
let A,B and C are my conditions. I want to put an order on these conditions like this:
If A satisfied then don't check B and C
If B satisfied then don't check C.
Is there any way to put this order?
As stated in this blog and quoted in this answer, I don't think you can guarantee the order of evaluation of an or expression.
Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.
However, you can nest the when() inside .otherwise() to form a series like this and achieve what you want:
df.select(
'*',
F.when((A),top_val['val'])
.otherwise(F.when((B),top_val['val'])
.otherwise(F.when((C), top_val['val']))).alias('match'))
I'd like to replace the following ABAP OpenSQL snippet (in the where clause of a much bigger statement) with an equivalent join.
... AND tf~tarifart = ( SELECT MAX( tf2~tarifart ) FROM ertfnd AS tf2 WHERE tf2~tariftyp = e1~tariftyp AND tf2~bis >= e1~bis AND tf2~ab <= e1~ab ) ...
My motivation: Query migration to ABAP CDS views (basically plain SQL with in comparison somewhat reduced expressiveness). Alas, correlated subqueries and EXISTS statements are not supported.
I googled a bit and found a possible solution (last post) here https://archive.sap.com/discussions/thread/3824523
However, the proposal
Selecting MAX(value)
Your scenarion using inner join to first CDS view
doesn't work in my case.
tf.bis (and tf.ab) need to be in the selection list of the new view to limit the rhs of the join (new view) to the correct time frames.
Alas, there could be multiple (non overlapping) sub time frames (contained within [tf.ab, tf.bis]) with the same tf.tarifart.
Since these couldn't be grouped together, this results in multiple rows on the rhs.
The original query does not have a problem with that (no join -> no Cartesian product).
I hope the following fiddle (working example) clears things up a bit: http://sqlfiddle.com/#!9/8d1f48/3
Given these constraints, to me it seems that an equivalent join is indeed impossible. Suggestions or even confirmations?
select doc_belzart,
doc_tariftyp,
doc_ab,
doc_bis,
max(tar_tarifart)
from
(
select document.belzart as doc_belzart,
document.tariftyp as doc_tariftyp,
document.ab as doc_ab,
document.bis as doc_bis,
tariff.tarifart as tar_tarifart,
tariff.tariftyp as tar_tariftyp,
tariff.ab as tar_ab,
tariff.bis as tar_bis
from dberchz1 as document
inner join ertfnd as tariff
on tariff.tariftyp = document.tariftyp and
tariff.ab <= document.ab and
tariff.bis >= document.bis
) as max_tariff
group by doc_belzart,
doc_tariftyp,
doc_ab,
doc_bis
Translated in English, you seem to want to determine the max applicable tariff for a set of documents.
I'd refactor this into separate steps:
Determine all applicable tariffs, meaning all tariffs that completely cover the document's time interval. This will become your first CDS view, and in my answer forms the sub-query.
Determine for all documents the max applicable tariff. This will form your second CDS view, and in my answer forms the outer query. This one has the MAX / GROUP BY to reduce the result set to one per document.
Assume events of either type A, B, C or D are being emitted. I want to detect whenever an event of type A is followed by an event of type B. In other words, I want to detect a sequences, for which Esper's EPL provides the -> operator.
However, what I described above is ambiguous, what I want is the following: Whenever a B is detected, I want it to be matched with the most recent A.
I have been playing around with EPL's syntax, but the best I could come up with was that:
select * from pattern [(every a=A) -> b=B]
This, however, matches each B with the oldest A that occured after the last match. Weird...
Help is much appreciated! :P
I use joins a lot for the simple matching. The other option is match-recognize. The join like this.
select * from B unidirectional, A.std:lastevent()
I got the following tables:
Teams
Matches
I want to get an output like:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
1 AMERICA CRUZ AZUL AMERICA
1 SANTOS MORELIA MORELIA
1 LEON CHIVAS LEON
The columns teams.nom_equipo reference to matches.num_eqpo_lo & to matches.num_eqpo_v and at the same time they reference to the column teams.nom_equipo to get the name of each team based on their id
Edit: I have used the following:
SELECT m.semana, t_loc.nom_equipo AS LOCAL, t_vis.nom_equipo AS VISITANTE,
CASE WHEN m.goles_loc > m.goles_vis THEN 'home'
WHEN m.goles_vis > m.goles_loc THEN 'visitor'
ELSE 'tie'
END AS Vencedor
FROM matches AS m
JOIN teams AS t_loc ON (m.num_eqpo_loc = t_loc.num_eqpo)
JOIN teams AS t_vis ON (m.num_eqpo_vis = t_vis.num_eqpo)
ORDER BY m.semana;
But as you can see from the table Matches in row #5 from the goles_loc column (home team) & goles_vis (visitor) column, they have 2 vs 2 (number of goals - home vs visitor) being a tie but and when I run the code I get something that is not a tie:
Matches' score
Resultset from Select:
I also noticed that since the row #5 the names of both teams in the matches are not correct (both visitor & home team).
So, the Select brings correct data but in other order different than the original order (referring to the order from the table matches)
The order from the second week must be:
matches.semana | teams.nom_equipo | teams.nom_equipo | Winner
5 2 CRUZ AZUL TOLUCA TIE
6 2 MORELIA LEON LEON
7 2 CHIVAS SANTOS TIE
Row 8 from the Resultset must be Row # 5 and so on.
Any help would be really thanked!
When doing a SELECT which includes null for a column, that's the value it will always be, so winner in your case will never be populated.
Something like this is probably more along the lines of what you want:
SELECT m.semana, t_loc.nom_equipo AS loc_equipo, t_vis.nom_equipo AS vis_equipo,
CASE WHEN m.goles_loc - m.goles_vis > 0 THEN t_loc.nom_equipo
WHEN m.goles_vis - m.goles_loc > 0 THEN t_vis.nom_equipo
ELSE NULL
END AS winner
FROM matches AS m
JOIN teams AS t_loc ON (m.nom_eqpo_loc = t.num_eqpo)
JOIN teams AS t_vis ON (m.nom_eqpo_vis = t.num_eqpo)
ORDER BY m.semana;
Untested, but this should provide the general approach. Basically, you JOIN to the teams table twice, but using different conditions, and then you need to calculate the scores. I'm using NULL to indicate a tie, here.
Edit in response to comment from OP:
It's the same table -- teams -- but the JOINs produce different results, because the query uses different JOIN conditions in each JOIN.
The first JOIN, for t_loc, compares m.nom_eqpo_loc to t.num_eqpo. This means it gets the teams rows for the home team.
The second JOIN, for t_vis, compares m.nom_eqpo_vis to t.num_eqpo. This means it gets the teams rows for the visting team.
Therefore, in the CASE statement, t_loc refers to the home team, while t_vis refers to the visting one, enabling both to be used in the CASE statement, enabling the correct name to be found for winning.
Edit in response to follow-up comment from OP:
My original query was sorting by m.semana, which means other columns can appear in any order (essentially whichever Postgres feels is most efficient).
If you want the resulting table to be sorted exactly the same way as the matches table, then use the same ORDER BY tuple in its ORDER BY.
So, the ORDER BY clause would then become:
ORDER BY m.semana, m.nom_eqpo_loc, m.nom_eqpo_vis
Basically, the matches table PRIMARY KEY tuple.
The code in the following gist is lifted almost verbatim out of a lecture in Martin Odersky's Functional Programming Principles in Scala course on Coursera:
https://gist.github.com/aisrael/7019350
The issue occurs in line 38, within the definition of union in class NonEmpty:
def union(other: IntSet): IntSet =
// The following expression doesn't behave associatively
((left union right) union other) incl elem
With the given expression, ((left union right) union other), largeSet.union(Empty) takes an inordinate amount of time to complete with sets with 100 elements or more.
When that expression is changed to (left union (right union other)), then the union operation finishes relatively instantly.
ADDED: Here's an updated worksheet that shows how even with larger sets/trees with random elements, the expression ((left ∪ right) ∪ other) can take forever but (left ∪ (right ∪ other)) will finish instantly.
https://gist.github.com/aisrael/7020867
The answer to your question is very much connected to Relational databases - and the smart choices they make. When a database "unions" tables - a smart controller system will make some decisions around things like "How large is Table A? Would it make more sense to Join A & B first, or A & C when the user writes:
A Join B Join C
Anyhow, you can't expect the same behavior when you are writing the code by hand - because you have specified the order you want exactly, using parenthesis. None of those smart decisions can happen automatically. (Though in theory they could, and that's why Oracle ,Teradata, mySql exist)
Consider a ridiculously large example:
Set A - 1 Billion Records
Set B - 500 Million Records
Set C - 10 Records
For arguments sake assume that the union operator takes O(N) records by the SMALLEST of the 2 sets being joined. This is reasonable, each key can be looked up in the other as a hashed retrieval:
A & B runtime = O(N) runtime = 500 Million
(let's assume the class is just smart enough to use the smaller of the two for lookups)
so
(A & B) & C
Results in:
O(N) 500 million + O(N) 10 = 500,000,010 comparisons
Again pointing to the fact that it was forced to compare 1 Billion records to 500 Million records FIRST, per inner parenthesis, then - pull in 10 more.
But consider this:
A & (B & C)
Well now something amazing happens:
(B & C) runtime O(N) = 10 record comparisons (each of the 10 C records is checked against B for existence)
then
A & (result) = O(N) = 10
Total = 20 comparisons
Notice that once (B & C) was completed, we only had to bump 10 records against 1 billion!
Both examples produces the exact same result; one in O(N) = 20 runtime, the other in 500,000,010 !
To summarize, this problem illustrates in just a small way some of the complex thinking that goes into database design and the smart optimization that happens in that software. These things do not always happen automatically in programming languages unless you've coded them that way, or by using a library of some sorts. You could for example write a function that takes several sets and intelligently decides the union order. But, the issue becomes unbelievable complex if other set operations have to be mixed in. Hope this helps.
Associativity is not about performance. Two expressions may be equivalent by associativity but one may be vastly harder than the other to actually compute:
(23 * (14/2)) * (1/7)
Is the same as
23 * ((14/2) * (1/7))
But if it were me evaluating the two, I'd reach the answer (23) in a jiffy with the second one, but take longer if I forced myself to work with just the first one.