Scala 2.13 overloaded foreachBatch - scala

I am trying to write stream dataset to snowflake as indicated below
val query = expandedDF.writeStream
.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch{(batchDF:DataFrame, batchID: Long) =>
batchDF.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(sfOptions)
.option("dbtable", "TWITTER")
.mode(SaveMode.Append)
.save()
}
.outputMode("update")
.start()
query.awaitTermination()
However, whenever this command is executed I run into the error
overloaded method foreachBatch with alternatives
anyone here knows how I can resolve this?

Related

Databricks, When should I use ".start()" with writeStream?

I am practicing with Databricks. In sample notebooks, I have seen different use of writeStream with or without .start() method. I have a few questions in this regard.
Samples are below:
Without .start():
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", source_format)
.option("cloudFiles.schemaLocation", checkpoint_directory)
.load(data_source)
.writeStream
.option("checkpointLocation", checkpoint_directory)
.option("mergeSchema", "true")
.table(table_name)
With .start():
(myDF
.writeStream
.format("delta")
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.start(path)
)
With .start():
query = (streaming_df.writeStream
.foreachBatch(streaming_merge.upsert_to_delta)
.outputMode("update")
.option("checkpointLocation", checkpoint_directory)
.trigger(availableNow=True)
.start())
query.awaitTermination()
Q1) I didn't understand where should / shouldn't use .start() method. I appreciate it if you could guide me on this.
Q2) If I don't pass path to the start(), where the data files will be written?

Structured Streaming exception: Append output mode not supported for streaming aggregations

I am getting the following error when I run my spark job:
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets;;
I am not sure if the issue is being caused due to lack of a watermark,which I don't know how to apply in this context.
Following is the aggregation operation applied:
def aggregateByValue(): DataFrame = {
df.withColumn("Value", expr("(BookingClass, Value)"))
.groupBy("AirlineCode", "Origin", "Destination", "PoS", "TravelDate", "StartSaleDate", "EndSaleDate", "avsFlag")
.agg(collect_list("Value").as("ValueSeq"))
.drop("Value")
}
Usage:
val theGroupedDF = theDF
.multiplyYieldByHundred
.explodeDates
.aggregateByValue
val query = theGroupedDF.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
Changing the outputMode to complete solved the issue.
val query = theGroupedDF.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
adding this would solve the problem:
val theGroupedDF = theDF
.multiplyYieldByHundred
.explodeDates
.aggregateByValue
//code bellow
.withColumn("timestamp", current_timestamp())
.withWatermark("timestamp", "10 minutes")

Not able to write Data in Parquet File using Spark Structured Streaming

I have a Spark Structured Streaming:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
I want to write data to FileSystem using DataStreamWriter,
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
But zero files are getting created in data folder. Only _spark_metadata is getting created.
However, I can see the data on console when format is console:
val query = df
.writeStream
.outputMode("append")
.format("console")
.start()
+--------------------+------------------+------------------+
| time| col1| col2|
+--------------------+------------------+------------------+
|49368-05-11 20:42...|0.9166470338147503|0.5576946794171861|
+--------------------+------------------+------------------+
I cannot understand the reason behind it.
Spark - 2.1.0
I had a similar problem but for different reasons, posting here in case someone has the same issue. When writing your output stream to file in append mode with watermarking, structured streaming has an interesting behavior where it won't actually write any data until a time bucket is older than the watermark time. If you're testing structured streaming and have an hour long water mark, you won't see any output for at least an hour.
I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
java.lang.IllegalArgumentException: ending offset not valid in streaming queries
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 48 elided
So, I removed endingOffsets from streaming query.
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "topic")
.load()
Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
... 48 elided
So, I added checkPointLocation:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("data")
After doing these modifications, I was able to save streaming queries' results in Parquet files.
But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it.
import spark.implicits._
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
ds1.collect.foreach(println)
ds1.writeStream
.format("console")
.start()
ds1.printSchema()
getting an error Exception in thread "main"
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
You are branching the query plan: from the same ds1 you are trying to:
ds1.collect.foreach(...)
ds1.writeStream.format(...){...}
But you are only calling .start() on the second branch, leaving the other dangling without a termination, which in turn throws the exception you are getting back.
The solution is to start both branches and await termination.
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
val query1 = ds1.collect.foreach(println)
.writeStream
.format("console")
.start()
val query2 = ds1.writeStream
.format("console")
.start()
ds1.printSchema()
query1.awaitTermination()
query2.awaitTermination()
I struggled a lot with this issue. I tried each of suggested solution from various blog.
But I my case there are few statement in between calling start() on query and finally at last i was calling awaitTerminate() function that cause this.
Please try in this fashion, It is perfectly working for me.
Working example:
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
If you write in this way that will cause exception/ error:
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
// some statement
// some statement
query.awaitTermination();
will throw given exception and will close your streaming driver.
i fixed issue by using following code.
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "streamTest2")
.load();
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
Kindly remove ds1.collect.foreach(println) and ds1.printSchema() , use outputMode and awaitAnyTermination for background process Waiting until any of the queries on the associated spark.streams has terminated
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local[*]")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA") .load()
val consoleOutput1 = ds1.writeStream
.outputMode("update")
.format("console")
.start()
spark.streams.awaitAnyTermination()
|key|value|topic|partition|offset|
+---+-----+-----+---------+------+
+---+-----+-----+---------+------+
I was able to resolves this issue by following code. In my scenario, I had multiple intermediate Dataframes, which were basically the transformations made on the inputDF.
val query = joinedDF
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
joinedDF is the result of the last transformation performed.

How to persist output of window() function in JDBC with Spark SQL DataFrame?

When the following snippet executes:
...
stream
.map(_.value())
.flatMap(MyParser.parse(_))
.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val dataFrame = rdd.toDF();
val countsDf = dataFrame.groupBy($"action", window($"time", "1 hour")).count()
val query = countsDf.write.mode("append").jdbc(url, "stats_table", prop)
})
....
This error happens: java.lang.IllegalArgumentException: Can't get JDBC type for struct<start:timestamp,end:timestamp>
How would one go about saving the output of org.apache.spark.sql.functions.window() function to a MySQL DB?
I ran into the same problem using SPARK SQL:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start
And I solved by adding a user defined function
val toString = udf{(window:GenericRowWithSchema) => window.mkString("-")}
For me String works, but you can change the function according to your needs, you can even have two functions to return start and end separately.
My query changed to:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.count()
.withColumn("window",toString($"window"))
.writeStream
.outputMode(OutputMode.Complete())
.options(prop)
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
.format("com.here.olympus.jdbc.sink.OlympusDBSinkProvider")
.start