Spark Dataframe.write() completed percentage - scala

I am trying to write a Dataframe to a file. As the data frame is quite large, I want to know what is the status of the write operation in terms of Progress percentage, because it continues execution for a good amount of time.
myDataFrame
.filter(myFilter)
.write
.json(ExportPath)
Is there any way to know the percentage of data written to file?
Or at least get the number of partitions that have completed individually?

For a quick manual check, you can check the processed amount of data in the Spark UI. For a more automated way of accessing the data, either the REST API or the Metrics library is helpful.

Related

Calling Trigger once in Databricks to process Kinesis Stream

I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern
import org.apache.spark.sql.streaming.Trigger
// Load your Streaming DataFrame
val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")
It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?
As you mentioned in the question the trigger once isn't supported for Kinesis.
But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.
A workaround is to stop after X runs, without trigger. It'll guarantee a fix number of rows per run.
The only issue is that if you have millions of rows waiting in the queue you won't have the guarantee to process all of them
In scala you can add an event listener, in python count the number of batches.
from time import sleep
s = sdf.writeStream.format("delta").start("/out/path")
#by defaut keep spark.sql.streaming.numRecentProgressUpdates=100 in the list. Stop after 10 microbatch
#maxRecordsPerFetch is 10 000 by default, so we will consume a max value of 10x10 000= 100 000 messages per run
while len(s.recentProgress) < 10:
print("Batchs #:"+str(len(s.recentProgress)))
sleep(10)
s.stop()
You can have a more advanced logic counting the number of message processed per batch and stopping when the queue is empty (the throughput should lower once it's all consumed as you'll only get the "real-time" flow, not the history)

dask dataframe groupby resulting in one partition memory issue

I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

Spark Structured Streaming Aggregation Output Interval

I'm reviewing the StructuredNetworkWordCountWindowed example in Apache Spark Structured Streaming and am having trouble finding information about how I can update the example to control the output intervals. When I run the example I receive output every time a micro batch is processed. I understand that this is intended because the main case is to process data and emit results in real time but what about the case where I want to process data in real time but output the state at some specific interval? Does Spark Structured Streaming support this scenario? I reviewed the programming guide and the only similar concept that is mentioned is the Trigger.ProcessingTime option. Unfortunately, this option is not quite what is needed since it applies to the batch processing time and the scenario described above still requires processing data in real time.
Is this feature supported? More specifically, how do I only output the state at the time the window ends assuming there are no late arrivals and using a tumbling window?

How processing Rate and Trigger interval inter-play in spark structured streaming?

I'd like to understand the following:
In Spark Structured streaming, there is the notion of trigger that says at which interval spark will try to read data to start a processing. What I would like to know is how long does the the readying operation may last? In particular in the context of Kafka, what exactly happens? Let say, we have configured spark to retrieve the latest offsets always. What I want to know is, does Spark try to read an arbitrary amount of data (as in from where it last left off up to the latest offset available) on each trigger? What if the readying operation is longer than the interval? What is supposed to happen at that point?
I wonder if there is a readying operation time that can be set, as in every trigger, keep readying for this amount of time? Or is the rate actually controlled in the two following ways:
Manually with maxOffsetsPerTrigger, and in that case, the trigger does not really matter,
Choose a trigger that make sense with respect to how much data you may have available and be able to process between triggers.
The second options sounds quite difficult to calibrate.

Spark Structured Streaming - Processing each row

I am using structured streaming with Spark 2.1.1. I need to apply some business logic to incoming messages (from Kafka source).
essentially, I need to pick up the message, get some key values, look them up in HBase and perform some more biz logic on the dataset. the end result is a string message that needs to be written out to another Kafka queue.
However, since the abstraction for incoming messages is a dataframe (unbounded table - structured streaming), I have to iterate through the dataset received during a trigger through mapPartitions (partitions due to HBase client not being serializable).
During my process, i need to iterate through each row for executing the business process for the same.
Is there a better approach possible that could help me avoid the dataFrame.mapPartitions call? I feel its sequential and iterative !!
Structured streaming basically forces me to generate an output data frame out of my business process, whereas there is none to start with. What other design pattern can I use to achieve my end goal ?
Would you recommend an alternative approach ?
When you talk about working with Dataframes in Spark, speaking very broadly, you can do one of 3 things
a) Generate a Dataframe
b) Transform a data frame
c) Consume a data frame
In structured streaming, a streaming DataFrame is generated using a DataSource. Normally you create sources using methods exposed sparkSession.readStream method. This method returns a DataStreamReader which has several methods for reading from various kinds of input. All of there return a DataFrame. Internally it creates a DataSource. Spark allows you to implement your own DataSource, but they recommend against it, because as of 2.2, the interface is considered experimental
You transform data frames mostly using map or reduce, or using spark SQL. There are different flavors of map (map, mapPartition, mapParititionWithIndex), etc. All of them basically take a row and return a row. Internally Spark does the work of parallelizing the calls to your map method. It partitions the data, spreads it around on executors on the cluster, and calls your map method in the executor. You don't need to worry about parallelism. It's built under the hood. mapParitions is not "sequential". Yes, rows within a partition are executed sequentially, but multiple partitions are executed in parallel. You can easily control the degree of parallelism by partitioning your dataframe. You have 5 partitions, you will have 5 processes running in parallel. You have 200, you can have 200 of them running in parallel if you have 200 cores
Note that there is nothing stopping you from going out to external systems that manage state inside your transformation. However, your transformations should be idempotent. Given a set of input, they should always generate the same output, and leave the system in the same state over time. This can be difficult if you are talking to external systems inside your transformation. Structured Streaming provides at least once guarantee. The means that the same row might be transformed multiple times. So, if you are doing something like adding money to a bank account, you might find that you have added the same amount of money twice to some of the accounts.
Data is consumed by sinks. Normally, you add a sink by calling the format method on a Dataframe and then calling start. StructuredStreaming has a handful of inbuilt sinks which (except for one) are more or less useless.You can create your custom Sink but again it's not recommended because the interface is experimental. The only useful sink is what you would implement. It is called ForEachSink. Spark will call your for each sink with all the rows in your partition. You can do whatever you want with the rows, which includes writing it to Hbase. Note that because of the at least once nature of Structured Streaming, the same row might be fed to your ForEachSink multiple times. You are expected to implement it in an idempotent manner. Also, if you have multiple sinks, data is written to sinks in parallel. You cannot control in what order the sinks are called. It can happen that one sink is getting data from one micro batch while another sink is still processing data for the previous micro batch. Essentially, the Sinks are eventually consistent, not immediately consistent.
Generally, the cleanest way to build your code is to avoid going to outside systems inside your transformations. Your transformations should purely transform data in data frames. If you want data from HBase, get it into a data frame, join it with your streaming data frame, and then transform it. This is because when you go to outside systems, it becomes difficult to scale. You want to scale up your transformations by increasing partitioning on your data frames and adding nodes. However, too many nodes talking to external systems can increase the load on the external systems and cause bottlenecks, Separating transformation from data retrieval allows you to scale them independently.
BUT!!!! there are big buts here......
1) When you talk about Structured streaming, there is no way to implement a Source that can selectively get data from your HBase based on the data in your input. You have to do this inside a map(-like) method. So, IMO, what you have is perfectly fine if the data in Hbase changes or there is a lot of data that you don't want to keep in memory. If your data in HBase is small and unchanging, then it's better to read it into a batch data frame, cache it and then join it with your streaming data frame. Spark will load all the data into its own memory/disk storage, and keep it there. If your data is small and changing very frequently, it's better to read it in a data frame, don't cache it and join it with a streaming data frame. Spark will load the data from HBase every time it runs a micro batch.
2) there is no way to order the execution of 2 separate Sinks. So, if your requirement requires you to write to a database, and write to Kafka, and you want to guarantee that a row in Kafka is written after the row is committed in the database, then the only way to do that is to
a) do both writes in a For each Sink.
b)write to one system in a map-like function and the other in a for each sink
Unfortunately, if you have a requirement that requires you to read data from a streaming source, join it with data from batch source, transform it, write it to database, call an API, get the result from the API and write the result of the API to Kafka, and those operations have to be done in exact order, then the only way you can do this is by implementing sink logic in a transformation component. You have to make sure you keep the logic separate in separate map functions, so you can parallelize them in an optimal manner.
Also, there is no good way to know when a micro-batch is completely processed by your application, especially if you have multiple sinks
try ForeachWriter, In ForeachWriter process() method receives single row from data frame.
and you can process the data as you want. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/ForeachWriter.html