Spark / Kafka Streaming : write a single file per hour - scala

I tried to write a single parquet file (one per hour) from data coming from kafka, instead of it, spark wrote a bunch of files.
Here is my code :
kafkaStreamDf
.select(col("topic"), col("timestamp"), date_format(current_timestamp(), "HHddMMyyyy").alias("date"))
.repartition(col("date"))
.writeStream.format("parquet")
.outputMode("append")
.option("path", "/usr/data/kafkaStream")
.option("checkpointLocation", "/tmp/checkpoint")
.partitionBy("date")
.start()
.awaitTermination()
)
All the data are stored in /usr/data/kafkaStream/date=HHddMMyyyy as I wanted.
But I don't get why spark wrote the output data into multiple files, whereas the dataframe is partitionned by "date" and repartitionned with the same column. I'm using spark 3.0.1.
What am I doing wrong ?
Thanks a lot for the help.

Related

Pyspark structured streaming trigger=availableNow get stuck on occasion

I have several tasks of streaming tables in pyspark (running on Databricks).
For the most part, it looks something like this:
stream = (
spark
.readStream
.option("maxFilesPerTrigger", 20)
.table("my_table")
)
transformed_df = stream.transform(some_function)
_ = (
transformed_df
.writeStream
.trigger(availableNow=True)
.outputMode("append")
.option("checkpointLocation", "/path/to/_checkpoints/output_table/input_table/")
.foreachBatch(lambda df, epochId: batch_writer(df, epochId, "output_table"))
.start()
.awaitTermination()
)
This works as expected and I can see each batch commit and offset in the checkpoints path. The problem is that, on occasion, the stream writer will just carry on (no new data coming in) and I can see the commit and offset go into the thousands. If I stop the job, it does not do it again and then later again.
I am running Pyspark on Databricks (Apapche Spark 3.2.1, Scala 2.12).
Should I do a check like if len(transformed_df.take(1)) > 0: # write stream? I suspect this will almost always be true and the stream is only determined at write.

Spark Structured Streaming groupByKey on a time Window not working

I need to batch my Kafka stream into time windows of 10 minutes each and then run some batch processing on it.
Note: records below have a timestamp field
val records = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerPool)
.option("subscribe", topic)
.option("startingOffsets", kafkaOffset)
.load()
I add a time window to each record using,
.withColumn("window", window($"timing", windowDuration))
I created some helper classes like
case class TimingWindow(
start: java.sql.Timestamp,
end: java.sql.Timestamp
)
case class RecordWithWindow(
record: MyRecord,
groupingWindow: TimingWindow
)
Now I have a DF of type [RecordWithWindow]
All this works very well.
Next,
metricsWithWindow
.groupByKey(_.groupingWindow)
//By grouping, I get several records per time window
//resulting an object of the below type which I write out to HDFS
case class WindowWithRecords(
records: Seq[MyRecord],
window: TimingWindow
)
Where I examine HDFS,
Example:
Expected :
Each WindowWithRecords object having a unique TimingWindow
WindowWithRecordsA(TimingWindowA, Seq(MyRecordA, MyRecordB, MyRecordC))
Actual :
More than one WindowWithRecords object with the same TimingWindow
WindowWithRecordsA(TimingWindowA, Seq(MyRecordA, MyRecordB))
WindowWithRecordsB(TimingWindowA, Seq(MyRecordC))
Looks like the groupByKey logic is not working well.
I hope my question is clear. Any pointers would be helpful.
Found the problem:
I was not using an explicit trigger when processing the window. As a result, Spark was creating micro batches as soon as it could, as opposed to doing it at the end of the window.
streamingQuery
.writeStream
.trigger(Trigger.ProcessingTime(windowDuration))
...
.start
This was a result of me misunderstanding Spark documentation.
Note: groupByKey uses the object's hashcode. It is important to make sure that the hashcode of the object is consistent.

writeStream of spark generates many small files

I am using Spark Structured Streaming (2.3) to write parquet data to buckets in the cloud ( Google Cloud Storage).
I am using the following function :
def writeStreaming(data: DataFrame, format: String, options: Map[String, String], partitions: List[String]): DataStreamWriter[Row] = {
var dataStreamWrite = data.writeStream .format(format).options(options).trigger(Trigger.ProcessingTime("120 seconds"))
if (!partitions.isEmpty)
dataStreamWrite = ddataStreamWrite.partitionBy(partitions: _*)
dataStreamWrite
}
Unfortunately, with this approach, I am getting many small files.
I tried to use the trigger approach in order to avoid this, but this didn't work too. Do you have any idea about how to handle this, please ?
Thanks a lot
The reason that you have many small files despite using trigger can be your dataframe having many partitions. To reduce the the parquet to 1 file/ 2 mins, you can coalesce to one partition before writing Parquet files.
var dataStreamWrite = data
.coalesce(1)
.writeStream
.format(format)
.options(options)
.trigger(Trigger.ProcessingTime("120 seconds"))

How to write a Dataset to Kafka topic?

I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))

Why does my query fail with AnalysisException?

I am new to Spark streaming. I am trying structured Spark streaming with local csv files. I am getting the below exception while processing.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
This is my code.
val df = spark
.readStream
.format("csv")
.option("header", "false") // Use first line of all files as header
.option("delimiter", ":") // Specifying the delimiter of the input file
.schema(inputdata_schema) // Specifying the schema for the input file
.load("file:///home/Teju/Desktop/SparkInputFiles/*.csv")
val filterop = spark.sql("select tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID,first(rssi_weightage(RSSI)) as RSSI_Weight from my_table where RSSI > -127 group by tagShortID,Timestamp,ListenerShortID,rootOrgID,subOrgID order by Timestamp ASC")
val outStream = filterop.writeStream.outputMode("complete").format("console").start()
I created cron job so every 5 mins I will get one input csv file. I am trying to parse through Spark streaming.
(This is not a solution but more a comment, but given its length it ended up here. I'm going to make it an answer eventually right after I've collected enough information for investigation).
My guess is that you're doing something incorrect on df that you have not included in your question.
Since the error message is about FileSource with the path as below and it is a streaming dataset that must be df that's in play.
FileSource[file:///home/Teju/Desktop/SparkInputFiles/*.csv]
Given the other lines I guess that you register the streaming dataset as a temporary table (i.e. my_table) that you then use in spark.sql to execute SQL and writeStream to the console.
df.createOrReplaceTempView("my_table")
If that's correct, the code you've included in the question is incomplete and does not show the reason for the error.
Add .writeStream.start to your df, as the Exception is telling you.
Read the docs for more detail.