Checkpoint for many streaming source - scala

im working with zeppelin ,I read many files from many source in spark streaming like this :
val var1 = spark
.readStream
.schema(var1_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE")
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var1 )
val chekpoint_var1 = var1
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var1)
.option("Path",path_checkpoint )
.option("header", true)
.outputMode("Append")
.queryName("var1_backup")
.start().awaitTermination()
val var2 = spark
.readStream
.schema(var2_raw)
.option("sep", ",")
.option("mode", "PERMISSIVE") //
.option("maxFilesPerTrigger", 100)
.option("treatEmptyValuesAsNulls", "true")
.option("newFilesOnly", "true")
.csv(path_var2 )
val chekpoint_var2 = var2
.writeStream
.format("csv")
.option("checkpointLocation", path_checkpoint_var2) //
.option("path",path_checkpoint_2 )
.option("header", true)
.outputMode("Append")
.queryName("var2_backup")
.start().awaitTermination()
when i re run the job i got this message :
java.lang.IllegalArgumentException: Cannot start query with name var1_backup as a query with that name is already active
*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe

*****************the solution*******************
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
and after i call the checkpoint function on the dataframe

Related

How to ingest data from two producers in kafka and join using spark structured streaming?

I am trying to read data from two kafka topics, but I am unable to join and find teh final dataframe.
My kafka topics are CSVStreamRetail and OrderItems.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
The output I am receiving is an empty dataframe.
First of all please check whether you are receiving data in your kafka topics.
You should always provide watermarking at least in one stream in case of a stream-stream join. I see you want to perform an inner join.
So I have added 200 seconds watermarking and now it is showing data in the output dataframe.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
Use the eventTimestamp for joining.
Let me know if this helps.

Except and ExceptAll functions for apache spark's dataset are giving an empty dataframe during streaming

Except and ExceptAll function for apache spark's dataset are giving empty dataframe during streaming.
I am using two datasets, both are batch,
the left-one has few rows that are not present in the right-one.
Upon running it gives correct output i.e. rows that are in the left-one but not in the right one.
Now I am repeating the same in streaming, the left-one is streaming while the right-one is batch source, in this scenario, I am getting an empty dataframe upon which dataframe.writeStream throws an exception None.get error.
package exceptPOC
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object exceptPOC {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("pocExcept")
.setMaster("local")
.set("spark.driver.host", "localhost")
val sparkSession =
SparkSession.builder().config(sparkConf).getOrCreate()
val schema = new StructType()
.add("sepal_length", DoubleType, true)
.add("sepal_width", DoubleType, true)
.add("petal_length", DoubleType, true)
.add("petal_width", DoubleType, true)
.add("species", StringType, true)
var df1 = sparkSession.readStream
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"some/path/exceptPOC/streamIris" //contains iris1
)
var df2 = sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("delimiter", ",")
.schema(schema)
.csv(
"/some/path/exceptPOC/iris2.csv"
)
val exceptDF = df1.except(df2)
exceptDF.writeStream
.format("console")
.start()
.awaitTermination()
}
}

OneHotEncoder with Streaming Dataframe

I would like to apply OneHotEncoder to multiple Columns in my Streaming Dataframe, but I've got the following error.
Any suggestions?
Many thanks!
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
CODE:
// Read csv
val Stream = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", ";")
.option("header", "true")
.schema(DFschema)
.load("C:/[...]"/
// Kafka
val properties = new Properties()
//val topic = "mongotest"
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Stream.selectExpr("CAST(Col AS STRING) AS KEY",
"to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("topic", "predict")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:[...]")
.start()
Subscribe to topic
val lines = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "predict")
.load()
val df = Stream
.selectExpr("CAST(value AS STRING)")
val jsons = df.select(from_json($"value", DFschema) as "data").select("data.*")
ETL[...]
Apply funcion Bucketizer() to field
val Msplits = Array(Double.NegativeInfinity,7, 14, 21, Double.PositiveInfinity)
val bucketizerM = new Bucketizer()
.setInputCol("MEASURE")
.setOutputCol("MEASURE_c")
.setSplits(Msplits)
val bucketedData1 = bucketizerD.transform(out)
val bucketedData2 = bucketizerM.transform(bucketedData1) # Works
Error using OneHotEncoder()
val indexer = new StringIndexer()
.setInputCol("CODE")
.setOutputCol("CODE_index")
val encoder = new OneHotEncoder()
.setInputCol("CODE")
.setOutputCol("CODE_encoded")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("A","B", "CODE_encoded"))
.setOutputCol("features")
val transformationPipeline = new Pipeline()
.setStages(Array(indexer, encoder, vectorAssembler))
val fittedPipeline = transformationPipeline.fit(bucketedData2) # Does't work

How to checkpoint many source of spark streaming

I have many CSV spark.readStream in a different locations, I have to checkpoint all of them with scala, I specified a query for every stream but when I run the job, I got this message
java.lang.IllegalArgumentException: Cannot start query with name "query1" as a query with that name is already active
I solved my problem by creating a many streaming query like this :
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()

Why does streaming query not write any data to HDFS?

I'm using Spark Structured Streaming with Spark 2.3.1 and below is my code:
val sparkSession = SparkSession
.builder
.appName("xxx")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.config("spark.rpc.netty.dispatcher.numThreads", "2")
.config("spark.shuffle.compress", "true")
.config("spark.rdd.compress", "true")
.config("spark.sql.inMemoryColumnarStorage.compressed", "true")
.config("spark.io.compression.codec", "snappy")
.config("spark.broadcast.compress", "true")
.config("spark.sql.hive.thriftServer.singleSession", "true")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.streaming.receiver.writeAheadLog.enable","true")
.enableHiveSupport()
.getOrCreate()
val rawStreamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <value>)
.option("subscribe", <value>)
.option("key.serializer", <value>)
.option("value.serializer", <value>)
.option("startingOffsets", "earliest")
.option("auto.offset.reset",earliest)
.option("group.id", <value>)
.option("fetchOffset.numRetries", 3)
.option("fetchOffset.retryIntervalMs", 10)
.option("IncludeTimestamp", true)
.option("enable.auto.commit", <value>)
.option("security.protocol", <value>)
.option("ssl.keystore.location", <value>)
.option("ssl.keystore.password", <value>)
.option("ssl.truststore.location", <value>)
.option("ssl.truststore.password", <value>)
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I'm trying to write the data to a file in the hdfs_path:
val query = rawStreamDF
.writeStream
.format("json")
.option("startingOffsets", "latest")
.option("path", "STREAM_DATA_PATH")
.option("checkpointLocation", "checkpointPath")
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.start
Logger.log.info("Status:"+query.status)
print("Streaming Status1:"+query.status)
query.awaitTermination(450)
But, I'm getting query.status value as below:
Status:{ "message" : "Initializing sources", "isDataAvailable" : false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
But, I'm getting query.status value as below.
Status:{ "message" : "Initializing sources", "isDataAvailable" :false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
All seems fine. The streaming engine of Spark Structured Streaming didn't seem to start the query yet, but just mark it as to be started on a separate thread.
If you created a separate thread for monitoring the structured query, you'd notice the status would change right after processing the very first batch.
Consult the official documentation in Structured Streaming Programming Guide.