Speed up repetitive KMeans in PySpark - pyspark

I currently have a dataframe of a billion rows on the commute times of 1 million people. There are two columns, one is the unique ID of each person and the other is time.
I want to perform KMeans clustering on each person based on their commute times. Selecting the commute times of a single person and doing Kmeans is very fast, but to do it a million times would take very long.
What I am doing right now:
for i in distinct_commuter_id:
df_input = df.filter(df['id']==i)
... # run KMeans on df_input
Any suggestions would be greatly appreciated.

Related

How to optimally decides the number of partition in dataframe dynamically?

I have two dataframes of around 11 million records .After transformation and some window analytic function I am having around 7 million records .I am currently trying to find a dynamic way to calculate the number of partition .Normally I take the size of dataframe from the ui and then divide it by 256mb (Partition Bytes which is by default 128) and decide the no of partition .I want to avoid this manual steps and like to know if there is any other dynamic and programmatic way of doing the same .Any help on this will be appreciated.
Thanks

Time to groupBy and sum spark DF rise proportionally to number of sums?

df.groupBy("c1").agg(sum("n1")).distinct.count()
would take 10 seconds
df.groupBy("c1").agg(sum("n1"), sum("n2")).distinct.count()
would take 20 seconds
It suprises me since row storage of DFs. Do you have same experience & how does this make sense? Also ideas how to make 2 sums run in more similar time to 1 sum? spark 2.2.0
I don't think "agg" takes two much more time in second case. I would look towards distinct.
You're executing distinct based on extra column n2, which gives broader distribution and increase complexity of distinct calulation.
It makes sense:
You increase number of computations twofold.
You increase shuffle size roughly 50%.
Both changes will impact overall performance, even if final result is small and impact on distinct is negligible.

Are there alternative solution without cross-join in Spark 2?

Stackoverflow!
I wonder if there is a fancy way in Spark 2.0 to solve the situation below.
The situation is like this.
Dataset1 (TargetData) has this schema and has about 20 milion records.
id (String)
vector of embedding result (Array, 300 dim)
Dataset2 (DictionaryData) has this schema and has about 9,000 records.
dict key (String)
vector of embedding result (Array, 300 dim)
For each vector of records in dataset 1, I want to find the dict key that will be the maximum when I compute cosine similarity it with dataset 2.
Initially, I tried cross-join dataset1 and dataset2 and calculate cosine simliarity of all records, but the amount of data is too large to be available in my environment.
I have not tried it yet, but I thought of collecting dataset2 as a list and then applying udf.
Are there any other method in this situation?
Thanks,
There might be two options the one is to broadcast Dataset2 since you need to scan it for each row of Dataset1 thus avoid the network delays by accessing it from a different node. Of course in this case you need to consider first if your cluster can handle the memory cost which 9000rows x 300cols(not too big in my opinion). Also you still need your join although with broadcasting should be faster. The other option is to populate a RowMatrix from your existing vectors and leave spark do the calculations for you

Nested loop (Scoring Matrix Factorisation) in spark, how to do it efficiently?

I have been trying to generate recommendations for some selected users in spark. This is done by dot-producting user factor (a Vector of n floats) with each product factor (Vector of n floats) and then order descendingly.
So, let's say I have customer factors as (customerId, Array[Float]) and I have product factors as (productId, Array[Float]). I have to create score of each product for every customer and produce (customerId, productId, score) where top N result for each customer is kept. So I do this:
val customers = ... // (customerId, Array[Float])
val products = ... // (productId, Array[Float])
val combination = customers.cartesian(products)
val result = combination.map(x => (combination._1._1, combination._2._1,
dotProd(combination._1._2, combination._2._2))
... then filter top N for each customer using dataframe
But this is taking ages and one reason is cartesian results in making the data size huge, repeating same product factor for each and every customer.
As you can see this 11 TB of data for 100K customers and 300K products. And this is the DAG created (I do a select and keep only top N of the scores hence the partition):
What would you suggest? How can I improve the process to get around the huge IO?
Thanks
UPDATE
In the end, it took 10 hours to run this on 48 cores.
And with 80TB of IO!
Update 2
I suspect the solution is to collect and then broadcast two RDDs and create cartesian on just the IDs and then lookup the factors. This will massively reduce the IO.
I will give it a go.
[NOTE: I will not accept my own answer since this is just an improvement, and not materially better]
As I described, I broadcasted customer and product factors and that sped up the process by almost x3 times and reduced IO to 2.4TB.
There could be even more improved approaches, but I guess this is OK for now.

multiple negative lag (aka lead) variable in h2o

Background:
Given that
I have a large table (~20 million rows, 40 columns).
I am using h2o to fit some models on the data in the table.
I found there is "h2o.difflag1" that gives a lag-1 transform.
I want to create a column that is "lead ~8000".
Question
How do I make a 3600 lead without making 8000 lags on 39 columns using h2o on my 20-million row dataset? Is there a better way to do this?