Not able to write Data in Parquet File using Spark Structured Streaming - scala

I have a Spark Structured Streaming:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
I want to write data to FileSystem using DataStreamWriter,
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
But zero files are getting created in data folder. Only _spark_metadata is getting created.
However, I can see the data on console when format is console:
val query = df
.writeStream
.outputMode("append")
.format("console")
.start()
+--------------------+------------------+------------------+
| time| col1| col2|
+--------------------+------------------+------------------+
|49368-05-11 20:42...|0.9166470338147503|0.5576946794171861|
+--------------------+------------------+------------------+
I cannot understand the reason behind it.
Spark - 2.1.0

I had a similar problem but for different reasons, posting here in case someone has the same issue. When writing your output stream to file in append mode with watermarking, structured streaming has an interesting behavior where it won't actually write any data until a time bucket is older than the watermark time. If you're testing structured streaming and have an hour long water mark, you won't see any output for at least an hour.

I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
java.lang.IllegalArgumentException: ending offset not valid in streaming queries
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 48 elided
So, I removed endingOffsets from streaming query.
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "topic")
.load()
Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
... 48 elided
So, I added checkPointLocation:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("data")
After doing these modifications, I was able to save streaming queries' results in Parquet files.
But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

Related

Where to write HDFS data such they can be read with HIVE

Given I write HDFS with apache spark like this:
var df = spark.readStream
.format("kafka")
//.option("kafka.bootstrap.servers", "kafka1:19092")
.option("kafka.bootstrap.servers", "localhost:29092")
.option("subscribe", "my_event")
.option("includeHeaders", "true")
.option("startingOffsets", "earliest")
.load()
df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
.add("id", StringType, true)
.add("timestamp", TimestampType, true)
df = df.select(
functions.col("topic"),
functions.col("partition"),
functions.col("offset"),
functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")
val query = df.writeStream
.format("csv")
.option("path", "hdfs://172.30.0.5:8020/test")
.option("checkpointLocation", "checkpoint")
.start()
query.awaitTermination()
Here hdfs://172.30.0.5:8020 is the namenode. It seems this spark program is writing data successfully to the nameode.
How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test then on the file-system?
Where is the location of test then on the file-system?
It's at /test
Note: if you properly configure fs.defaultFS in the core-site.xml, then you don't need to specify the full namenode address.
Do I have to write the data into a special folder that hive can see it?
You can, and that would be easiest, but the docs cover both options of "managed" (a dedicated HDFS location) and "external" (any other directory, with other restrictions) Hive tables
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
How can I query this data from hive?
See above link.
FWIW, Confluent has a Kafka Connector that can write data to HDFS and create Hive tables

Read from Kafka topic process the data and write back to Kafka topic using scala and spark

Hi Im reading froma kafka topic and i want to process the data received from kafka such as tockenization, filtering out unncessary data, removing stop words and finally I want to write back to another Kafka topic
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.show(false)
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
val writeStream = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
writeStream.awaitTermination()
Then I am getting the below error
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();;
Then I have edited my code as follows to read from kafka and write into console
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
// then perform the data processing part as mentioned in the first half
With the second method, continuously data was displaying in the console but it never run through data processing part. Can I know how can I read from a kafka topic and then perform some actions ( tokenization, removing stop words) on the received data and finally writing back to a new kafka topic?
EDIT
Stack Trace is pointing at df.show(false) in the above code during the error
There are two common problems in your current implementation:
Apply show in a streaming context
Code after awaitTermination will not be executed
To 1.
The method show is an action (as opposed to a tranformation) on a dataframe. As you are dealing with streaming dataframes this will cause an error as streaming queries need to be started with start (just as the Excpetion text is telling you).
To 2.
The method awaitTermination is a blocking method which means that subsequent code will not be executed in each micro-batch.
Overall Solution
If you want to read and write to Kafka and in-between want to understand what data is being processed by showing the data in the console you can do the following:
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.writeStream
.outputMode("append")
.format("console")
.start()
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
// the columns `key` and `value` of the DataFrame `cleanedDataframe` will be used for producing the message into the Kafka topic.
val writeStreamKafka = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
existingSparkSession.awaitAnyTermination()
Note the existingSparkSession.awaitAnyTermination() at the very end of the code without using awaitTermination directly after the start. Also, remember that the columns key and value of the DataFrame cleanedDataframe will be used for producing the message into the Kafka topic. However, a column key is not required, see also here
In addition, in case you are using checkpointing (recommended) then you need to have two different locations set: one for the console stream and the other one for the kafka output stream. It is important to keep in mind that those the streaming queries run independently.

Stream from kafka to kafka with Spark Structured Streaming in scala

I am trying to run simple variations of examples from official spark tutorial and a book "spark streaming in action".
The content of exceptions are strange. What is wrong with my code?
First of all I start kafka zookeeper, server, producer and 2 consumers. Then I run following code:
// read from kafka
val df = sparkService.sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic1)
.load()
// write to kafka
import sparkService.sparkSession.implicits._
val query = df.selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
.writeStream
.outputMode(OutputMode.Append())
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", topic2)
.option("checkpointLocation", "/home/pt/Dokumenty/tmp/")
.option("failOnDataLoss", "false") // only when testing
.start()
query.awaitTermination(30000)
Error occurs on writting to kafka:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got 1
1609627750463

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it.
import spark.implicits._
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
ds1.collect.foreach(println)
ds1.writeStream
.format("console")
.start()
ds1.printSchema()
getting an error Exception in thread "main"
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
You are branching the query plan: from the same ds1 you are trying to:
ds1.collect.foreach(...)
ds1.writeStream.format(...){...}
But you are only calling .start() on the second branch, leaving the other dangling without a termination, which in turn throws the exception you are getting back.
The solution is to start both branches and await termination.
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
val query1 = ds1.collect.foreach(println)
.writeStream
.format("console")
.start()
val query2 = ds1.writeStream
.format("console")
.start()
ds1.printSchema()
query1.awaitTermination()
query2.awaitTermination()
I struggled a lot with this issue. I tried each of suggested solution from various blog.
But I my case there are few statement in between calling start() on query and finally at last i was calling awaitTerminate() function that cause this.
Please try in this fashion, It is perfectly working for me.
Working example:
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
If you write in this way that will cause exception/ error:
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
// some statement
// some statement
query.awaitTermination();
will throw given exception and will close your streaming driver.
i fixed issue by using following code.
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "streamTest2")
.load();
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
Kindly remove ds1.collect.foreach(println) and ds1.printSchema() , use outputMode and awaitAnyTermination for background process Waiting until any of the queries on the associated spark.streams has terminated
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local[*]")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA") .load()
val consoleOutput1 = ds1.writeStream
.outputMode("update")
.format("console")
.start()
spark.streams.awaitAnyTermination()
|key|value|topic|partition|offset|
+---+-----+-----+---------+------+
+---+-----+-----+---------+------+
I was able to resolves this issue by following code. In my scenario, I had multiple intermediate Dataframes, which were basically the transformations made on the inputDF.
val query = joinedDF
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
joinedDF is the result of the last transformation performed.