How to optimally decides the number of partition in dataframe dynamically? - scala

I have two dataframes of around 11 million records .After transformation and some window analytic function I am having around 7 million records .I am currently trying to find a dynamic way to calculate the number of partition .Normally I take the size of dataframe from the ui and then divide it by 256mb (Partition Bytes which is by default 128) and decide the no of partition .I want to avoid this manual steps and like to know if there is any other dynamic and programmatic way of doing the same .Any help on this will be appreciated.
Thanks

Related

PySpark: Efficient strategy of splitting my dataframe when writing to a delta table

I would like to know if there is an efficient strategy to write my Spark dataframe in a delta Table in Datalake.
As a rule of thumb I am splitting the dataframe into some column that has between 70 and 300 different values.
The 'trick' I use to see which column is the candidate to use in the "partitionBy" is the following.
I transform my dataframe into a temporary table and look at the cardinality.
df.createOrReplaceTempView("my_table")
%sql
select
count(distinct(column1)) as column1,
count(distinct(column2)) as column2,
...
from my_table
Then I pick the column with a cardinality between 70 - 300, depending on the size of the table
mentally calculating table_size / 128 MB -->is this correct ?
df.write.partitionBy("column_candidate")
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.save(outputpaht)
This method I use does not seem very scientific, and I would like to know if there is a better way to estimate it.I have also seen that there is something called "repartition" but I don't know how to use it or if it is interesting.
How can I calculate the partitions in a more scientific way?
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all. Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster. If a cluster has 30 cores then programmers want their RDDs to have 30 cores at the very least or maybe 2 or 3 times of that.
Some acclaimed guidelines for the number of partitions in Spark are as follows-
When the number of partitions is between 100 and 10K partitions based on the size of the cluster and data, the lower and upper bound should be determined.
o The lower bound for spark partitions is determined by 2 X number of cores in the cluster available to application.
o Determining the upper bound for partitions in Spark, the task should take 100+ ms time to execute. If it takes less time, then the partitioned data might be too small or the application might be spending extra time in scheduling tasks.
For more information Refer this article

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

Time to groupBy and sum spark DF rise proportionally to number of sums?

df.groupBy("c1").agg(sum("n1")).distinct.count()
would take 10 seconds
df.groupBy("c1").agg(sum("n1"), sum("n2")).distinct.count()
would take 20 seconds
It suprises me since row storage of DFs. Do you have same experience & how does this make sense? Also ideas how to make 2 sums run in more similar time to 1 sum? spark 2.2.0
I don't think "agg" takes two much more time in second case. I would look towards distinct.
You're executing distinct based on extra column n2, which gives broader distribution and increase complexity of distinct calulation.
It makes sense:
You increase number of computations twofold.
You increase shuffle size roughly 50%.
Both changes will impact overall performance, even if final result is small and impact on distinct is negligible.

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

multiple negative lag (aka lead) variable in h2o

Background:
Given that
I have a large table (~20 million rows, 40 columns).
I am using h2o to fit some models on the data in the table.
I found there is "h2o.difflag1" that gives a lag-1 transform.
I want to create a column that is "lead ~8000".
Question
How do I make a 3600 lead without making 8000 lags on 39 columns using h2o on my 20-million row dataset? Is there a better way to do this?