Tips on decreasing network shuffle in spark - scala

I have this use case where I am joining two dataframes in Spark, A and B.
A -> Huge dataframe approx size: 100 TB
B -> Smaller dataframe approx size: 100 MB
Two questions:
How to reduce network shuffle as spark UI showed shuffle read of about 30gb.
The number of tasks is also huge approx 1,000,000. Any tips to reduce them?
I have tried caching dataframe A, but surprisingly it only made the job slower.
Any help would be appreciated.

You can try increasing autoBroadcastJoinThreshold to 100MB in order to trigger a map-side join, or if that doesn't help, to explicitly broadcast your B (smaller) dataframe:
val result = dfA.join(broadcast(dfB),...
That should eliminate join-related shuffle completely.

Related

PySpark: Efficient strategy of splitting my dataframe when writing to a delta table

I would like to know if there is an efficient strategy to write my Spark dataframe in a delta Table in Datalake.
As a rule of thumb I am splitting the dataframe into some column that has between 70 and 300 different values.
The 'trick' I use to see which column is the candidate to use in the "partitionBy" is the following.
I transform my dataframe into a temporary table and look at the cardinality.
df.createOrReplaceTempView("my_table")
%sql
select
count(distinct(column1)) as column1,
count(distinct(column2)) as column2,
...
from my_table
Then I pick the column with a cardinality between 70 - 300, depending on the size of the table
mentally calculating table_size / 128 MB -->is this correct ?
df.write.partitionBy("column_candidate")
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.save(outputpaht)
This method I use does not seem very scientific, and I would like to know if there is a better way to estimate it.I have also seen that there is something called "repartition" but I don't know how to use it or if it is interesting.
How can I calculate the partitions in a more scientific way?
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all. Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster. If a cluster has 30 cores then programmers want their RDDs to have 30 cores at the very least or maybe 2 or 3 times of that.
Some acclaimed guidelines for the number of partitions in Spark are as follows-
When the number of partitions is between 100 and 10K partitions based on the size of the cluster and data, the lower and upper bound should be determined.
o The lower bound for spark partitions is determined by 2 X number of cores in the cluster available to application.
o Determining the upper bound for partitions in Spark, the task should take 100+ ms time to execute. If it takes less time, then the partitioned data might be too small or the application might be spending extra time in scheduling tasks.
For more information Refer this article

Memory-efficient row-wise shuffle Polars

A simple row-wise shuffle in Polars with
df = df.sample(frac=1.0)
has a peak memory usage of 2x the size of the dataframe (profiling with mprof).
Is there any fast way to perform a row-wise shuffle in Polars while keeping the memory usage down as much as possible? Shuffling column by column (or a batch of columns at a time) with the same seed (or .take with random index) does the trick but is quite slow.
A shuffle is not in-place. Polars memory is often shared between columns/series/arrow.
A shuffle therefore has to allocate a new memory buffer. If we shuffle the whole DataFrame in parallel (which sample does). We allocate new buffers in parallel and write the shuffled data, hence the 2x memory usage.

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

Time to groupBy and sum spark DF rise proportionally to number of sums?

df.groupBy("c1").agg(sum("n1")).distinct.count()
would take 10 seconds
df.groupBy("c1").agg(sum("n1"), sum("n2")).distinct.count()
would take 20 seconds
It suprises me since row storage of DFs. Do you have same experience & how does this make sense? Also ideas how to make 2 sums run in more similar time to 1 sum? spark 2.2.0
I don't think "agg" takes two much more time in second case. I would look towards distinct.
You're executing distinct based on extra column n2, which gives broader distribution and increase complexity of distinct calulation.
It makes sense:
You increase number of computations twofold.
You increase shuffle size roughly 50%.
Both changes will impact overall performance, even if final result is small and impact on distinct is negligible.

Speed up collaborative filtering for large dataset in Spark MLLib

I'm using MLlib's matrix factorization to recommend items to users. I have about a big implicit interaction matrix of M=20 million users and N=50k items. After training the model I want to get a short list(e.g. 200) of recommendations for each user. I tried recommendProductsForUsers in MatrixFactorizationModel but it's very very slow (ran 9 hours but still far from finish. I'm testing with 50 executors, each with 8g memory). This might be expected since recommendProductsForUsers need to calculate all M*N user-item interactions and get top for each user.
I'll try use more executors but from what I saw from the application detail on Spark UI, I doubt that it can finish in hours or a day even I have 1000 executors (after 9hours it's still in the flatmap here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L279-L289, 10000 total tasks and only ~200 finished)
Are there any other things that I can tune to speed up the recommendation process beside increasing # of executors?
Here is sample code:
val data = input.map(r => Rating(r.getString(0).toInt, r.getString(1).toInt, r.getLong(2))).cache
val rank = 20
val alpha = 40
val maxIter = 10
val lambda = 0.05
val checkpointIterval = 5
val als = new ALS()
.setImplicitPrefs(true)
.setCheckpointInterval(checkpointIterval)
.setRank(rank)
.setAlpha(alpha)
.setIterations(maxIter)
.setLambda(lambda)
val model = als.run(ratings)
val recommendations = model.recommendProductsForUsers(200)
recommendations.saveAsTextFile(outdir)
#Jack Lei: Did you find the answer to this?
I myself tried few things but only helped a little.
For eg: I tried
javaSparkContext.setCheckpointDir("checkpoint/");
This helps becuase it avoid repeated computation in between.
Also tried adding more memory per Executor and overhead spark memory
--conf spark.driver.maxResultSize=5g --conf spark.yarn.executor.memoryOverhead=4000