I'm using PySpark Structured Streaming with Kafka as a reader stream. For the sake of unit testing, I would like to replace Kafka reader stream with a mock. How can I do this?
The following question is equivalent to mine but I use Python (PySpark) instead of Scala. And I couldn't find MemoryStream in PySpark.
How to perform Unit testing on Spark Structured Streaming?
Related
I am trying to write unit test case for my spark bigquery implementation.
Say, If I want to test this below piece of code:
val wordsDF = (spark.read.format("bigquery")
.option("table","bigquery-public-data:samples.shakespeare")
.load()
.cache())
And do some assertion on wordsDF
I am trying to think on the lines of:
Embedded kafka for kafka
Embedded cassandra for cassandra
Do we have any such library or alternative that can be used to write spark bigquery unit test
I want to create a Spark Session within a Nifi Custom processor written in Scala, so far I can create my spark session on a scala project, but when I add this spark session inside the OnTrigger method of the nifi custom processor, the spark session is never created, is there any way to achieve this? so far I have imported spark-core and spark-sql libraries
any feedback is appreciated
Not possible with Flow File. Period.
You need Kafka in between Spark Streaming or Spark Structured Streaming. Here is good read btw: https://community.cloudera.com/t5/Community-Articles/Spark-Structured-Streaming-with-NiFi-and-Kafka-using-PySpark/ta-p/245068
I am having an scala spark application in which I need to switch between streaming from kafka vs kinesis based on the application configuration.
Both the spark API's for kafka streaming (spark-streaming-kafka-0-10_2.11) and kinesis streaming (spark-streaming-kinesis-asl_2.11) returns an InputDStream on creating a stream, but the value types are different.
Kafka stream creating returns InputDStream[ConsumerRecord[String, String]],
whereas, Kinesis stream creating returns InputDStream[Array[Byte]]
Is there any API that returns a generic InputDStream irrespective of kafka or kinesis, so that my stream processing can have a generic implementation, instead of having to write separate code for kafka and kinesis.
I tried assigning both the stream to a InputDStream[Any], but that did not work.
Appreciate any idea on how this can be done.
My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?
No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.
I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844