How to run multiple structured streams in parallel?

How to run multiple structured streams in parallel? - scala

I am trying to write to streams parallel, but just the last one is working.
Here is my code.
val trump_topic = filteredQueryTrump.select($"ID".as("key"), $"title".as("value"))
.writeStream.format("kafka")
.trigger(Trigger.ProcessingTime("10 seconds"))
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "SCRAM-SHA-256")
.option("kafka.sasl.jaas.config", """org.apache.kafka.common.security.scram.ScramLoginModule required username="" password="";""")
.option("kafka.bootstrap.servers", "rocket-01.srvs.cloudkafka.com:9094")
.option("topic", "trump")
.option("checkpointLocation", "/tmp").start()
val biden_topic = filteredQueryBiden.select($"ID".as("key"), $"title".as("value"))
.writeStream.format("kafka")
.trigger(Trigger.ProcessingTime("4 seconds"))
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "SCRAM-SHA-256")
.option("kafka.sasl.jaas.config", """org.apache.kafka.common.security.scram.ScramLoginModule required username="" password="";""")
.option("kafka.bootstrap.servers", "rocket-01.srvs.cloudkafka.com:9094")
.option("topic", "biden")
.option("checkpointLocation", "/tmp").start()
I tried spark.streams.awaitAnyTermination() but it didn't help either.

You need to have two different checkpoint locations for each output stream.
Also, do add awaitAnyTermination as you already have tried out to ensure the jobs are running continously.
spark.streams.awaitAnyTermination()
As an alternative you could combine those two queries into one by doing
val combinedDf = filteredQueryTrump.select($"ID".as("key"), $"title".as("value"), lit("trump").as("topic"))
.union(filteredQueryBiden.select($"ID".as("key"), $"title".as("value"), lit("biden").as("topic")))
The content of the column topic will be interpreted by the spark-sql-kafka-0-10 library to decide into which topic the row should be written to. Theoretically, you can use that technique to write to multiple Kafka topics within one stream. When doing this it is important to not use topic option when calling writeStream. However, in that way you will not be able to have two different Trigger durations but only one for the overall stream.
An example you code could look like this:
val outputQuery = combinedDf.select($"key", $"value", $"topic")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("4 seconds"))
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "SCRAM-SHA-256")
.option("kafka.sasl.jaas.config", """org.apache.kafka.common.security.scram.ScramLoginModule required username="" password="";""")
.option("kafka.bootstrap.servers", "rocket-01.srvs.cloudkafka.com:9094")
.option("checkpointLocation", "/tmp")
outputQuery.start()
outputQuery.awaitTermination()

Related

How to stream a single topic of kafka , filter by key into multiple location of hdfs?

I am not being to stream my data on multiple hdfs location , which is filtered by key. So below code is not working. Please help me to find the correct way to write this code
val ER_stream_V1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
val ER_stream_V2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
ER_stream_V1.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value1")
.select(functions.from_json(col("value").cast("string"), Value1Schema.schemaExecution).as("value")).select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/execution/checkpoint2005")
.option("path", "/tmp/test/value1")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()
ER_stream_V2.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value2")
.select(functions.from_json(col("value").cast("string"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/jobparameters/checkpoint2006")
.option("path", "/tmp/test/value2")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()

You should not need two readers. Create one and filter twice. You might also want to consider startingOffsets as earliest to read existing topic data
For example.
val ER_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest") // maybe change?
.option("failOnDataLoss", "false")
.load()
.toDF()
.select(col("key").cast("string").as("key"), col("value").cast("string"))
val value1Stream = ER_stream
.filter(col("key") === "Value1")
.select(functions.from_json(col("value"), Value1Schema.schemaExecution).as("value"))
.select("value.*")
val value2Stream = ER_stream
.filter(col("key") === "Value2")
.select(functions.from_json(col("value"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
value1Stream.writeStream.format("orc")
...
.start()
value2Stream.writeStream.format("orc")
...
.start()

Stream from kafka to kafka with Spark Structured Streaming in scala

I am trying to run simple variations of examples from official spark tutorial and a book "spark streaming in action".
The content of exceptions are strange. What is wrong with my code?
First of all I start kafka zookeeper, server, producer and 2 consumers. Then I run following code:
// read from kafka
val df = sparkService.sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic1)
.load()
// write to kafka
import sparkService.sparkSession.implicits._
val query = df.selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
.writeStream
.outputMode(OutputMode.Append())
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", topic2)
.option("checkpointLocation", "/home/pt/Dokumenty/tmp/")
.option("failOnDataLoss", "false") // only when testing
.start()
query.awaitTermination(30000)
Error occurs on writting to kafka:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got 1
1609627750463

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""

def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala

Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

Not able to write Data in Parquet File using Spark Structured Streaming

I have a Spark Structured Streaming:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
I want to write data to FileSystem using DataStreamWriter,
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
But zero files are getting created in data folder. Only _spark_metadata is getting created.
However, I can see the data on console when format is console:
val query = df
.writeStream
.outputMode("append")
.format("console")
.start()
+--------------------+------------------+------------------+
| time| col1| col2|
+--------------------+------------------+------------------+
|49368-05-11 20:42...|0.9166470338147503|0.5576946794171861|
+--------------------+------------------+------------------+
I cannot understand the reason behind it.
Spark - 2.1.0

I had a similar problem but for different reasons, posting here in case someone has the same issue. When writing your output stream to file in append mode with watermarking, structured streaming has an interesting behavior where it won't actually write any data until a time bucket is older than the watermark time. If you're testing structured streaming and have an hour long water mark, you won't see any output for at least an hour.

I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
java.lang.IllegalArgumentException: ending offset not valid in streaming queries
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 48 elided
So, I removed endingOffsets from streaming query.
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "topic")
.load()
Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
... 48 elided
So, I added checkPointLocation:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("data")
After doing these modifications, I was able to save streaming queries' results in Parquet files.
But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it.
import spark.implicits._
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
ds1.collect.foreach(println)
ds1.writeStream
.format("console")
.start()
ds1.printSchema()
getting an error Exception in thread "main"
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;

You are branching the query plan: from the same ds1 you are trying to:
ds1.collect.foreach(...)
ds1.writeStream.format(...){...}
But you are only calling .start() on the second branch, leaving the other dangling without a termination, which in turn throws the exception you are getting back.
The solution is to start both branches and await termination.
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
val query1 = ds1.collect.foreach(println)
.writeStream
.format("console")
.start()
val query2 = ds1.writeStream
.format("console")
.start()
ds1.printSchema()
query1.awaitTermination()
query2.awaitTermination()

I struggled a lot with this issue. I tried each of suggested solution from various blog.
But I my case there are few statement in between calling start() on query and finally at last i was calling awaitTerminate() function that cause this.
Please try in this fashion, It is perfectly working for me.
Working example:
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
If you write in this way that will cause exception/ error:
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
// some statement
// some statement
query.awaitTermination();
will throw given exception and will close your streaming driver.

i fixed issue by using following code.
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "streamTest2")
.load();
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()

Kindly remove ds1.collect.foreach(println) and ds1.printSchema() , use outputMode and awaitAnyTermination for background process Waiting until any of the queries on the associated spark.streams has terminated
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local[*]")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA") .load()
val consoleOutput1 = ds1.writeStream
.outputMode("update")
.format("console")
.start()
spark.streams.awaitAnyTermination()
|key|value|topic|partition|offset|
+---+-----+-----+---------+------+
+---+-----+-----+---------+------+

I was able to resolves this issue by following code. In my scenario, I had multiple intermediate Dataframes, which were basically the transformations made on the inputDF.
val query = joinedDF
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
joinedDF is the result of the last transformation performed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to run multiple structured streams in parallel? - scala

Related

How to stream a single topic of kafka , filter by key into multiple location of hdfs?

Stream from kafka to kafka with Spark Structured Streaming in scala

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

Not able to write Data in Parquet File using Spark Structured Streaming

Queries with streaming sources must be executed with writeStream.start();

Categories

Resources