trace expensive part of code in spark ui back to part of pyspark - pyspark

I have some pyspark code with a very large number of joins and aggregation. I've enabled spark ui and I've been digging in to the event timeling, job stages, and dag visualization. I can find the task id and executor id for the expensive parts. Does anyone have a tip how I can tie the expensive parts from the spark ui output (task id, executor id) back to parts of my pyspark code? Like I can tell from the output that the expensive parts are caused by a large amount of shuffle operations from all my joins, but it would be really handy to identify which join was the main culprit.

Your best approach is to start applying actions to your dataframes in various parts of the code. Pick a place, write it to a file, read it back, and continue.
This will allow you to identify your bottlenecks. As you can observe small portions of the executions in the UI as well.

Related

Talend dynamic jobs slower than tFilterRow

I've got a parent-job reading files/messages and based on the type of message it should run a child-job that is contains all logic required for processing that type of message. Using Dynamic jobs makes it incredibly slow and I don't know why.
Currently we have this growing tree of specific jobs, which I have trouble with grouping back together for a finishing task that they will all have similar.
I tried to use a dynamic job because the job that needs to be called is known in the model,
however this makes the job painfully slow.
Are there 'better' solutions, ways to speed this up or anything that I could be missing?

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2
A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

Spark Scala - processing different child dataframes parallely in bulk

I am working on a fraudulent transaction detection project which makes use of spark and primarily uses rule-based approach to risk score incoming transactions. For this rule based approach, several maps are created from the historical data to represent the various patterns in transactions and these are then used later while scoring the transaction. Due to rapid increase in data size, we are now modifying code to generate these maps at each account level.
earlier code was for eg.
createProfile(clientdata)
but now it becomes
accountList.map(account=>createProfile(clientData.filter(s"""account=${account}""")))
Using this approach , the profiles are generated but since this operations are happening sequentially , hence it doesn't seem to be feasible.
Also, createProfile function itself is making use of dataframes, sparkContext/SparkSessions hence, this is leading to the issueof not able to send these tasks to worker nodes as based on my understanding only driver can access the dataframes and sparkSession/sparkContext. Hence , the following code is not working
import sparkSession.implicit._
val accountListRdd=accountList.toSeq.toDF("accountNumber")
accountList.rdd.map(accountrow=>createProfile(clientData.filter(s"""account=${accountrow.get(0).toString}""")))
The above code is not working but represents the logic for the desired output behaviour.
Another approach, i am looking at is using multithreading at driver level using scala Future .But even in this scenario , there are many jvm objects being created in a single createProfile function call , so by increasing threads , even if this approach works , it can lead to a lot jvm objects, which itself canlead to garbage collection and memory overhead issues.
just to put timing perspective, createProfile takes about 10 min on average for a single account and we have 3000 accounts , so sequentially it will take many days. With multi threading even if we achieve a factor of 10 , it will take many days. So we need parallelism in the order of 100s .
One of things that could have worked in case it existed was ..lets say if there is Something like a spark groupBy within a groupBY kind of operation, where at first level we can group By "account" and then do other operations
(currently issue is UDF won't be able to handle the kind of operations we want to perform)
Another solution if practically possible is the way SPark Streaming works-
it has a forEachRDD method and also spark.streaming.concurrentjobs parameter which allows processing of multiple RDDs in parallel . I am not sure how it works but maybe that kind of implementation may help.
Above is the problem description and my current views on it.
Please let me know if anyone has any idea about this! Also ,I will prefer a logical change rather than suggestion of different technology

How to do parallel pipeline?

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?
I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.
You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).
But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.