I am working on an apache beam pipeline to run a SQL aggregation function.Reference: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslJoinTest.java#L159.
The example here works fine.However, when I replace the source with an actual unbounded source and do an aggregation, I see no results.
Steps in my pipeline:
Read bounded data from a source and convert to collection of rows.
Read unbounded json data from a websocket source.
Assign timestamp to the every source stream via a DoFn.
Convert the unbounded json to unbounded row collection
Apply a window on the row collection
Apply a SQL statement.
Output the result of the sql.
A normal SQL statement executes and outputs the results. However, when I use a group by in the SQL, there is no output.
SELECT
o1.detectedCount,
o1.sensor se,
o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
The results are continous and like shown below.
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":1,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
The results don't show up at all when I change the sql to
SELECT
COUNT(o1.detectedCount) o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY o2.sensor
Is there anything I am doing wrong in this implementation.Any pointers would be really helpful.
Some suggestions come up when reading your code:
Extend the window, to allow lateness, and to emit early arrived data.
.apply("windowing", Window.<Row>into(FixedWindows.of(Duration.standardSeconds(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(2))))
.withAllowedLateness(Duration.standardMinutes(10))
.discardingFiredPanes());
Try to remove the join and check if without it you have output to the window,
Try to add more time to the window. because sometimes it is too short to shuffle the data between the workers. and the joined streams aren't emitted at the same time.
outputWithTimestamp will output the rows in a different timestamp, and then they can be dropped when you don't allow lateness.
Read the docs for outputWithTimestamp, this API is a bit risky.
If the input {#link PCollection} elements have timestamps, the output
timestamp for each element must not be before the input element's
timestamp minus the value of {#link getAllowedTimestampSkew()}. If an
output timestamp is before this time, the transform will throw an
{#link IllegalArgumentException} when executed. Use {#link
withAllowedTimestampSkew(Duration)} to update the allowed skew.
CAUTION: Use of {#link #withAllowedTimestampSkew(Duration)} permits
elements to be emitted behind the watermark. These elements are
considered late, and if behind the {#link
Window#withAllowedLateness(Duration) allowed lateness} of a downstream
{#link PCollection} may be silently dropped.
SELECT
COUNT(o1.detectedCount) as number
,o2.sensor
,sa
FROM SENSOR o1
LEFT OUTER JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY sa,o1.sensor,o2.sensor
Related
I'm trying to get the count of events in a ksqlDB table within an arbitrary time window.
The table my_table was created with a WINDOW SESSION.
It is important to note the query is being run after all data was processed, and the ksqlDB server is basically doing nothing.
My query looks something like this
count(*) as count
FROM my_table
WHERE WINDOWSTART < (1602010972370 + 5000) AND WINDOWEND > 1602010972370
group by 1 emit changes;
Running this kind of query will very often return one result row, and immediately after a second result row with the actual "final" result.
It doesn't look like its a result of values in the table not being "settled" yet, because if I repeat the same query (as many times as I want) I get the same exact behavior.
I'm assuming there is some configuration value which will let ksqlDB to wait just a little longer (in the order of one second) before it returns the result, so I could get the final result in the first row?
BTW using emit final will not work on the query itself since it only apply to "windowed querys"
I have a table containing a large number of records. There's a column defining a type of the record. I'd like to collect records with a specific value in that column. Kind of:
Select * FROM myVeryOwnTable WHERE type = "VERY_IMPORTANT_TYPE"
What I've noticed I can't use WHERE clause in a custom query when I choose incremental(+timestamp) mode, otherwise I'd need to take care if filtering on my own.
The background of that I'd like to achieve is that I use Logstash to transfer some type of data from MySQL to ES. That's easily achievable there by using query that can contain where clause. However, with Kafka I can transfer my data much quicker (almost instantly) after inserting new rows in DB.
Thank you for any hints or advices.
Thanks to #wardziniak I was able to set it up.
query=select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p
topic.prefix=test-mysql-jdbc-
incrementing.column.name=id
however, I was expecting a topic test-mysql-jdbc-myVeryOwnTable so I've registered my consumer to that. However, using the query shown above table name is skipped so my topic was named exactly as prefix defined above. So I've just updated my properties topic.prefix=test-mysql-jdbc-myVeryOwnTable and it seems to be working just fine.
You can use subquery in your Jdbc Source Connector query property.
Sample JDBC Source Connector configuration:
{
...
"query": "select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p",
"incrementing.column.name": "id",
...
}
I am new to Esper and I am trying to filter the events properties from event streams having multiple events coming with high velocity.
I am using Kafka to send row by row of CSV from producer to consumer and at consumer I am converting those rows to HashMap and creating Events at run-time in Esper.
For example I have events listed below which are coming every 1 second
WeatherEvent Stream:
E1 = {hum=51.0, precipi=1, precipm=1, tempi=57.9, icon=clear, pickup_datetime=2016-09-26 02:51:00, tempm=14.4, thunder=0, windchilli=, wgusti=, pressurei=30.18, windchillm=}
E2 = {hum=51.5, precipi=1, precipm=1, tempi=58.9, icon=clear, pickup_datetime=2016-09-26 02:55:00, tempm=14.5, thunder=0, windchilli=, wgusti=, pressurei=31.18, windchillm=}
E3 = {hum=52, precipi=1, precipm=1, tempi=59.9, icon=clear, pickup_datetime=2016-09-26 02:59:00, tempm=14.6, thunder=0, windchilli=, wgusti=, pressurei=32.18, windchillm=}#
Where E1, E2...EN are multiple events in WeatherEvent
In the above events I just want to filter out properties like hum, tempi, tempm and presssurei because they are changing as the time proceeds ( during 4 secs) and dont want to care about the properties which are not changing at all or are changing really slowly.
Using below EPL query I am able to filter out the properties like temp, hum etc
#Name('Out') select * from weatherEvent.win:time(10 sec)
match_recognize (
partition by pickup_datetime?
measures A.tempm? as a_temp, B.tempm? as b_temp
pattern (A B)
define
B as Math.abs(B.tempm? - A.tempm?) > 0
)
The problem is I can only do it when I specify tempm or hum in the query for pattern matching.
But as the data is coming from CSV and it has high-dimensions or features so I dont know what are the properties of Events before-hand.
I want Esper to automatically detects features/properties (during run-time) which are changing and filter it out, without me specifying properties of events.
Any Ideas how to do it? Is that even possible with Esper? If not, can I do it with other CEP engines like Siddhi, OracleCEP?
You may add a "?" to the event property name to get the value of those properties that are not known at the time the event type is defined. This is called dynamic property see documentation . The type returned is Object so you need to downcast.
From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.
I have used sc.broadcast for lookup files to improve the performance.
I also came to know there is a function called broadcast in Spark SQL Functions.
What is the difference between two?
Which one i should use it for broadcasting the reference/look up tables?
one word answer :
1) org.apache.spark.sql.functions.broadcast() function is user supplied,explicit hint for given sql join.
2) sc.broadcast is for broadcasting readonly shared variable.
More details about broadcast function #1 :
Here is scala doc from
sql/execution/SparkStrategies.scala
which says.
Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that
side has an explicit broadcast hint (e.g. the user applied the *
[[org.apache.spark.sql.functions.broadcast()]] function to a
DataFrame), then that side * of the join will be broadcasted
and the other side will be streamed, with no shuffling *
performed. If both sides of the join are eligible to be broadcasted
then the *
Shuffle hash join: if the average size of a single
partition is small enough to build a hash * table.
Sort merge: if the matching join keys are sortable.
If there is no joining keys, Join implementations are chosen with the following precedence:
BroadcastNestedLoopJoin: if one side of the join could be broadcasted
CartesianProduct: for Inner join
BroadcastNestedLoopJoin
The below method controls the behavior based on size we set to
spark.sql.autoBroadcastJoinThreshold by default it is 10mb
Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.
/** Matches a plan whose output should be small enough to be used in broadcast join.
**/
private def canBroadcast(plan: LogicalPlan): Boolean = {
plan.statistics.isBroadcastable ||
plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
}
In future the below configurations will be deprecated in coming versions of spark.
If you want to achieve broadcast join in Spark SQL you should use broadcast function (combined with desired spark.sql.autoBroadcastJoinThreshold configuration). It will:
Mark given relation for broadcasting.
Adjust SQL execution plan.
When output relation is evaluated it will take care of collecting data, and broadcasting, and applying correct join mechanism.
SparkContext.broadcast is used to handle local objects and is applicable for use with Spark DataFrames.