Message batching Apache BEAM pipeline triggering immediately instead of after fixed window - apache-beam

I have an Apache BEAM pipeline that I would like to read from a Google PubSub topic, apply deduplication, and emit the messages to another Pubsub topic on (at the end of) 15-min fixed windows. Managed to get it working with the deduplication, however, the issue is that the messages seem to get sent to the topic immediately instead of waiting for the end of the 15-mins.
Even after applying Window.triggering(AfterWatermark.pastEndOfWindow()) it didn't seem to be working (i.e. messages are emitted immediately). (Ref: https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html).
Seeking help on what is wrong with my code? Full code below:
Also, would it be correct to assume that the Deduplication takes the fixed window as its bound, or I would need to separately set the time domain for deduplication (Ref: https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/transforms/Deduplicate.html seems to say that it would default to the time domain which would be the fixed window defined)
pipeline
.apply("Read from source topic", PubsubIO.<String>readStrings().fromTopic(SOURCE_TOPIC))
.apply("De-duplication ",
Deduplicate.<String>values()
)
.apply(windowDurationMins + " min Window",
Window.<String>into(FixedWindows.of(Duration.standardMinutes(windowDurationMins)))
.triggering(
AfterWatermark.pastEndOfWindow()
)
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Format to JSON", ParDo.of(new DataToJson()))
.apply("Emit to sink topic",
PubsubIO.writeStrings().to(SINK_TOPIC)
);
[Update]
Tried the following but nothing seemed to work
Removed deduplication
Changed to .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
Read from topic with timestampAttribute: PubsubIO.<String>readStrings().fromTopic(SOURCE_TOPIC).withTimestampAttribute("publishTime"))
Windowing seems to require some kind of timestamp associated with each data element. However, .withTimestampAttribute("publishTime") from PubsubIO didn't seem to work. Is there something else I could try to add a timestamp to my data for windowing?
[Update 2]
Tried manually attaching a timestamp based on this ref (https://beam.apache.org/documentation/programming-guide/#adding-timestamps-to-a-pcollections-elements) as below - but it STILL doesn't work
.apply("Add timestamp", ParDo.of(new ApplyTimestamp()))
public class ApplyTimestamp extends DoFn<String, String> {
#ProcessElement
public void addTimestamp(ProcessContext context) {
try {
String data = context.element();
Instant timestamp = new Instant();
context.outputWithTimestamp(data, timestamp);
} catch(Exception e) {
LOG.error("Error timestamping", e);
}
}
}
At this point I feel like I'm about to go crazy lol...

A GBK transform is needed in-between the immediate windowing after reading from source and the deduplication logic. Windows are applied in the next GroupByKeys, including one within composite transforms. GBK will group elements by the combination of keys and windows. Also, note that the default trigger is already AfterWatermark with 0 allowed lateness

Related

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Keeping entries in a state store just for a defined time

Problem: I need to find out how sent a message in last e.g. 24 hours. I have following stream and state store for lookups.
#SendTo(Bindings.MESSAGE_STORE)
#StreamListener(Bindings.MO)
public KStream<?, ?> groupBySender(KStream<String, Message> messages) {
return messages.selectKey((key,message) -> message.from)
.map((k,v) -> new KeyValue<>(k, v.sentAt.toString()))
.groupByKey()
.reduce((oldTimestamp, newTimestamp) -> newTimestamp,
Materialized.as(AggregatorApplication.MESSAGE_STORE))
.toStream();
}
it works fine
[
"key=123 value=2019-06-21T13:29:05.509Z",
"key=from value=2019-06-21T13:29:05.509Z",
]
so a look up loos like :
store.get(from);
but I would like to remove entries older then 24h automatically from to store, currently they will be persisted probable forever
Is there a better way how to do it? maybe some windowing operation or so?
Atm, KTables (that are basically key-value stores) don't support a TTL (cf. https://issues.apache.org/jira/browse/KAFKA-4212)
The current recommendation is to use a windowed-store if you want to expire data. You might want to use a custom .transform() instead of windowedBy().reduce() so, to get more flexibility. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html)

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

Incorrect partition detected by PartitionScanner in custom TextEditor for eclipse

I have a PartitionScanner that extends RuleBasedPartitionScanner in my custom text editor plugin for Eclipse. I am having issues with the partition scanner detecting character sequences within larger strings, resulting in document being partitioned incorrectly. For example, within the constructor of m partition scanner I have following rule set-up:
public MyPartitionScanner() {
...
rules.add(new MultiLineRule("SET", "ENDSET", mytoken));
...
}
However, if I happen to use a token that contains the character sequence "SET," it seems like partition scanner would continue searching for endSequence("ENDSET") and will make the rest of the document as single partition set to "mytoken."
var myRESULTSET34 = ...
Is there a way to make the partition scanner ignore the word "SET" from the token above? And only recognize the whole word "SET"?
Thank you.
Using the MultilineRule as is, you won't be able to differentiate. But you can create your own subclass that overrides sequenceDetected and does a lookback/lookahead when the super impl returns true, to make sure that it's preceded/followed by EOF/whitespace. If it doesn't, then push back the characters onto the scanner and return false.