I'm doing structured streaming with Spark. I'm trying to save the output of my query to the Hadoop system. But there is a dilemma: Because I'm using aggregate, I have to set the output mode as "complete", at the same time, file sink only works with parquet. Any workaround for this? Below are my codes:
val userSchema = new StructType().add("user1", "integer")
.add("user2", "integer")
.add("timestamp", "timestamp")
.add("interaction", "string")
val tweets = spark
.option("sep", ",")
val windowedCounts = tweets.filter("interaction='MT'")
window($"timestamp", "10 seconds", "10 seconds")
val query = windowedCounts.writeStream
.option("path", "partb_q2")
.option("checkpointLocation", "checkpoint")
I have many CSV spark.readStream in a different locations, I have to checkpoint all of them with scala, I specified a query for every stream but when I run the job, I got this message
java.lang.IllegalArgumentException: Cannot start query with name "query1" as a query with that name is already active
I solved my problem by creating a many streaming query like this :
val spark = SparkSession
.config("spark.local", "local[*]")
val event1 = spark
.readStream //
.option("header", "true")
.option("sep", ",")
val query = event1.writeStream
I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.appName("Write Parquet to Kafka")
import spark.implicits._
val ds: DataFrame = spark.readStream
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.option("maxFilesPerTrigger", 1)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.
I have a simple streams that reads some data from a Kafka topic:
val ds = spark
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.option("startingOffsets", "earliest")
val df = ds.selectExpr("cast (value as string) as json")
.select(from_json($"json", schema).as("data"))
I want to store this data in S3 based on the day it's received, so something like:
When I want to write the data I do:
.option("path", s3_path)
But if I do this I get to only specify one path. Is there a way to change the s3 path dynamically based on the date?
Use partitionBy clause:
import org.apache.spark.sql.functions._
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
.partitionBy("year", "month", "day")
... // all other options
I'm trying to read the messages from kafka (version 10) in spark and trying to print it.
import spark.implicits._
val spark = SparkSession
.config("spark.master", "local")
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
getting an error Exception in thread "main"
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
You are branching the query plan: from the same ds1 you are trying to:
But you are only calling .start() on the second branch, leaving the other dangling without a termination, which in turn throws the exception you are getting back.
The solution is to start both branches and await termination.
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
val query1 = ds1.collect.foreach(println)
val query2 = ds1.writeStream
I struggled a lot with this issue. I tried each of suggested solution from various blog.
But I my case there are few statement in between calling start() on query and finally at last i was calling awaitTerminate() function that cause this.
Please try in this fashion, It is perfectly working for me.
Working example:
val query = df.writeStream
If you write in this way that will cause exception/ error:
val query = df.writeStream
// some statement
// some statement
will throw given exception and will close your streaming driver.
i fixed issue by using following code.
val df = session
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "streamTest2")
val query = df.writeStream
Kindly remove ds1.collect.foreach(println) and ds1.printSchema() , use outputMode and awaitAnyTermination for background process Waiting until any of the queries on the associated spark.streams has terminated
val spark = SparkSession
.config("spark.master", "local[*]")
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA") .load()
val consoleOutput1 = ds1.writeStream
I was able to resolves this issue by following code. In my scenario, I had multiple intermediate Dataframes, which were basically the transformations made on the inputDF.
val query = joinedDF
.option("truncate", "false")
joinedDF is the result of the last transformation performed.
When the following snippet executes:
.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val dataFrame = rdd.toDF();
val countsDf = dataFrame.groupBy($"action", window($"time", "1 hour")).count()
val query = countsDf.write.mode("append").jdbc(url, "stats_table", prop)
This error happens: java.lang.IllegalArgumentException: Can't get JDBC type for struct<start:timestamp,end:timestamp>
How would one go about saving the output of org.apache.spark.sql.functions.window() function to a MySQL DB?
I ran into the same problem using SPARK SQL:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")
And I solved by adding a user defined function
val toString = udf{(window:GenericRowWithSchema) => window.mkString("-")}
For me String works, but you can change the function according to your needs, you can even have two functions to return start and end separately.
My query changed to:
val query3 = dataFrame
.groupBy(org.apache.spark.sql.functions.window($"timeStamp", "10 minutes"), $"data")
.option("checkpointLocation", "file:///tmp/spark-checkpoint1")
.option("table", "temp")