Why memory sink writes nothing in append mode? - scala

I used Spark's structured streaming to stream messages from Kafka. The data was then aggregated, and wrote to a memory sink with append mode. However, when I tried to query the memory, it returned nothing. Below are the code:
result = model
.withColumn("timeStamp", col("startTimeStamp").cast("timestamp"))
.withWatermark("timeStamp", "5 minutes")
.groupBy(window(col("timeStamp"), "5 minutes").alias("window"))
.agg(
count("*").alias("total")
);
// writing to memory
StreamingQuery query = result.writeStream()
.outputMode(OutputMode.Append())
.queryName("datatable")
.format("memory")
.start();
// query data in memory
new Timer().scheduleAtFixedRate(new TimerTask() {
#Override
public void run() {
sparkSession.sql("SELECT * FROM datatable").show();
}
}, 10000, 10000);
The result is always:
|window|total|
+------+-----+
+------+-----+
If I used outputMode = complete, then I could get the aggregated data. But that's not my choice as the requirement is to use append mode.
Is there any problem with the code?
Thanks,

In append mode,
The output of a windowed aggregation is delayed the late threshold specified in withWatermark()
In your case, the delay is 5 minutes, I know nothing about your input data, but I guess you probably need to wait for 5 minutes.
I suggest you read (again?) the docs for Structured Streaming:

Related

pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram

I am having glue streaming job, and I need to write the data as stream but after applying some processing, so I did the following:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
glueContext.forEachBatch(
frame=data_frame_DataSource0,
batch_function=processBatch,
options={
"windowSize": window_size,
"checkpointLocation": s3_path_spark
}
)
and in processBatch I do some processing and at the end of it i do the following:
df.writeStream.format("hudi").options(**combinedConf).outputMode('append').start()
I am getting the following error:
pyspark.sql.utils.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFram
as far as I unserstand that the df I am trying to write is not streaming that's why it's giving the error, I am not so aware how can I change it from the glue context and how I can apply the processing on the streaming data then writeStream it?
any idea please?
forEachBatch method processes streaming Dataset/DataFrame on batches of Dataset/DataFrame, so when we call it on data_frame_DataSource0, the df passed to the method processBatch is a normal Dataset/DataFrame contains a batch of data.
You have two options to fix this:
deal with the df as normal dataframe:
df.write.format("hudi").options(**combinedConf).mode("append").save()
apply your stream processing directly on data_frame_DataSource0:
data_frame_DataSource0 = glueContext.create_data_frame.from_catalog(
database=database_name,
table_name=kinesis_table_name,
transformation_ctx="DataSource0",
additional_options={"inferSchema": "true", "startingPosition": starting_position_of_kinesis_iterator}
)
(
data_frame_DataSource0.writeStream.format("hudi")
.options(**combinedConf)
.option("inferSchema", "true")
.option("startingPosition", starting_position_of_kinesis_iterator)
.outputMode('append').start()
)

Unit testing Kafka streams with groupByKey/windowedBy/count

My question is similar to: How to unit test a kafka stream application that uses session window
Topology looks like
.filter()
.groupByKey()
.windowedBy(SessionWindows.with(30).grace(5))
.count()
.toStream()
.selectKey((k, v)->k.key())
.to(outTopic)
When I run this application, and send data like:
key1, {somejson}
key1, {somejson}
key1, {somejson}
In the output topic, I correctly see the record after 30 seconds as expected
key1, 3
When I write a unit test for the same (after reading the other question about advancedWallClockTime, my test code looks like:
final Instant now = Instant.now();
// Send messages with one second difference timestamps
testDriver.pipeInput(consumerRecordFactory.create(inputTopicName, "key1", json, now.toEpochMilli()));
testDriver.pipeInput(consumerRecordFactory.create(inputTopicName, "key1", json, now.plusMillis(1000L).toEpochMilli()));
testDriver.pipeInput(consumerRecordFactory.create(inputTopicName, "key1", json, now.plusMillis(2000L).toEpochMilli()));
testDriver.advanceWallClockTime(35000L)
Then I try to compare the results
ProducerRecord<String, Long> life = testDriver.readOutput(outputTopicName, stringSerde.deserializer(), longSerde.deserializer());
Assert.assertEquals(lifevalue, Long.valueOf(3));
I expect it to be 3 but it seems its always 1. But if I write something like:
List<ProducerRecord<String, Long>> expectedList = Arrays.asList(
new ProducerRecord<String, Long>(outputTopicName, "key1", 1L),
new ProducerRecord<String, Long>(outputTopicName, "key1", 2L),
new ProducerRecord<String, Long>(outputTopicName, "key1", 3L)
);
for (ProducerRecord<String, Long> expected : expectedList) {
ProducerRecord<String, Long> actual = testDriver.readOutput(outputTopicName, stringSerde.deserializer(), longSerde.deserializer());
Assert.assertEquals(expected.value(), actual.value());
}
then my test passes.
What I am doing wrong? Eventually, I would like to add data for two different keys and test that both of them are coming with count: 3L.
The difference you see with regards to testing is how the TopologyTestDriver works. It might help first to explain how Kafka Streams treats stateful operations for some context.
When you run the Kafka Streams application, "for real" records from stateful operations are buffered by the internal cache. Kafka Streams flushes the internal cache when either of the two following conditions is met:
Committing records (default commit interval is 30 seconds)
The cache is full.
From what you describe above, you observe the count of 3 after streams commits the consumed offsets. The first two records were replaced in the cache, and only the last count of 3 is emitted.
Now with the TopologyTestDriver, there is no internal caching; the test driver forwards each record. As a result, you'll have to call testDriver.readOutput for each record you've submitted.
So your line above
ProducerRecord<String, Long> life = testDriver.readOutput(outputTopicName, stringSerde.deserializer(), longSerde.deserializer());
emits the first record you supplied via testDriver.pipeInput. As you only called testDriver.readOutput once.
You'll notice in your second code example:
for (ProducerRecord<String, Long> expected : expectedList) {
ProducerRecord<String, Long> actual = testDriver.readOutput(outputTopicName, stringSerde.deserializer(), longSerde.deserializer());
Assert.assertEquals(expected.value(), actual.value());
}
You get the expected result because you execute testDriver.readOutput the same number of times as you've input test records.
HTH,
Bill

Spark streaming store method only work in Duration window but not in foreachRDD workflow in customized receiver

I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?

How to access the data from streaming query in "memory" table for subsequent batch queries?

Given a writeStream call:
val outDf = (sdf.writeStream
.outputMode(outputMode)
.format("memory")
.queryName("MyInMemoryTable")
.trigger(Trigger.ProcessingTime(interval))
.start())
How can I run a sql against the MyInMemoryTable e.g.
val df = spark.sql("""select Origin,Dest,Carrier,avg(DepDelay) avgDepDelay
from MyInMemoryTable group by 1,2,3""")
The documentation for Spark Structured Streaming says that batch and streaming queries can be intermixed but the above is not working:
'writeStream' can be called only on streaming Dataset/DataFrame;
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only
on streaming Dataset/DataFrame;
So how can the InMemoryTable be used in subsequent queries?
The following post on Hortonworks site has an approach that seems promising https://community.hortonworks.com/questions/181979/spark-structured-streaming-formatmemory-is-showing.html
Here is the sample writeStream - which is of the same form as my original question:
StreamingQuery initDF = df.writeStream()
.outputMode("append")
.format("memory")
.queryName("initDF")
.trigger(Trigger.ProcessingTime(1000))
.start();
sparkSession.sql("select * from initDF").show();
initDF.awaitTermination();
And here is the response:
Okay,the way it works is :
In simple terms,think that The main Thread of your code launches
another thread in which your streamingquery logic runs.
meanwhile ,your maincode is blocking due to
initDF.awaitTermination().
sparkSession.sql("select * from initDF").show() => This code run on the mainthread ,and it reaches there only for the first time.
So update your code to :
StreamingQuery initDF = df.writeStream() .outputMode("append") .format("memory") .queryName("initDF") .trigger(Trigger.ProcessingTime(1000)) .start();
while(initDF.isActive){
Thread.sleep(10000)
sparkSession.sql("select * from initDF").show()
}
Now the main thread of your code will be going through the loop over and over again and it queries the table.
Applying the suggestions to my code results in :
while(outDf.isActive) {
Thread.sleep(30000)
strmSql(s"select * from $table", doCnt = false, show = true, nRows = 200)
}
outDf.awaitTermination(1 * 20000)
Update This worked great: I am seeing updated results after each mini batch.

Apache Beam: Error assigning event time using Withtimestamp

I have an unbounded Kafka stream sending data with the following fields
{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}
I read the stream using the apache beam sdk for kafka
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("kafka:9092")
.withTopic("test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
.commitOffsetsInFinalize()
.withoutMetadata()))
Since I want to window using event time ("ts" in my example), i parse the incoming string and assign "ts" field of the incoming datastream as the timestamp.
PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply("ParseTemperature", ParDo.of(new ParseTemperature()));
tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
The window function and the computation is applied as below:
PCollection<Output> output = tempCollection.apply(Window
.<Temperature>into(FixedWindows.of(Duration.standardSeconds(30)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.standardDays(1))
.accumulatingFiredPanes())
.apply(new ComputeMax());
I stream data into the input stream with a lag of 5 seconds from current utc time since in practical scenrios event timestamp is usually earlier than the processing timestamp.
I get the following error:
Cannot output with timestamp 2019-01-16T11:15:45.560Z. Output
timestamps must be no earlier than the timestamp of the current input
(2019-01-16T11:16:50.640Z) minus the allowed skew (0 milliseconds).
See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing
the allowed skew.
If I comment out the line for AssignTimeStamps, there are no errors but I guess, then it is considering the processing time.
How do I ensure my computation and windows are based on event time and not for processing time?
Please provide some inputs on how to handle this scenario.
To be able to use custom timestamp, first You need to implement CustomTimestampPolicy, by extending TimestampPolicy<KeyT,ValueT>
For example:
public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> {
protected Instant currentWatermark;
public CustomFieldTimePolicy(Optional<Instant> previousWatermark) {
currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
#Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) {
currentWatermark = new Instant(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
#Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
}
Then you need to pass your custom TimestampPolicy, when you setting up your KafkaIO source using functional interface TimestampPolicyFactory
KafkaIO.<String, Foo>read().withBootstrapServers("http://localhost:9092")
.withTopic("foo")
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializerAndCoder(KafkaAvroDeserializer.class, AvroCoder.of(Foo.class)) //if you use avro
.withTimestampPolicyFactory((tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
.updateConsumerProperties(kafkaProperties))
This line is responsible for creating a new timestampPolicy, passing a related partition and previous checkpointed watermark see the documentation
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))
Have you had a chance to try this using the time stamp policy, sorry I have not tried this one out myself, but I believe with 2.9.0 you should look at using the policy along with the KafkaIO read.
https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withTimestampPolicyFactory-org.apache.beam.sdk.io.kafka.TimestampPolicyFactory-