Complex multiple operations in Pyspark avoiding loops - pyspark

My problem is simplified that:
I have several hundred of subtables within one large pyspark dataset visualized in the picture below. In that subtables there are missing values (marked in light blue) I have to calculate. I executed one operation shown in the picture. To calculate that value, I have to sum up three values to the left of the missing value and divide it through the value below. The formula is also shown in the picture.
There is no problem calculating the value in the picture, but in the next columns to the right, I have to use the results of the calculations in the columns to their left. My task is to fill out all the blue lighted cells within one subset and fill out all subsets.
My current solution is to export the pyspark dataset to pandas, creating a for loop over all columns (and subtables), read out each column (and the three to its left) within one loop to a numpy array, doing the calculation on that column and write back the result to the pandas dataframe. The algorithm is quite fast but if the number of subtables exceeds about 800 I get a memory overflow message.
Now I‘m looking for a solution in pyspark (or pandas) avoiding a loop and that out of memory proble, because the number of subtables is more than 6,000!
one subtable with formula for one missing value

Related

aggregate sets into bigger sets in pyspark

In my dataset, I have a column A that is a list of integers, that represent a set of integers.
I want to aggregate all these sets into the set that represents the union (still represented as a list I guess).
Is there a "simple" way to do this?
What I've done is:
agg(array_distinct(flatten(collect_list("A")))
but this seems inefficient as it seems that at some point, there will be the totally flatened list in memory that contains all duplicates

How to pass a vector from tableau to R

I have a need to pass a vector of arguments to Rserve from tableau. Specifically, I am using IRR calculations in R (on Rserve), and i want to pass vector of cash-flows that are as columns in my table (instead of rows/measure). So, i want to collect all those CF in a vector and pass it on to Rserve. Passing them one at a time slows down IO.
SCRIPT_REAL("r_func(c(.arg1, .arg2, .arg3))",sum(cf1), sum(cf2), sum(cf3))
cf1..cfn are cashflows corresponding to various periods. Above code works well when cf are few but takes a long time when i have few hundereds. Further, time spent is not in calculation but IO when communicating with remote Rserve. If i have a local Rserve, this calculation happens under few seconds while on remote, it takes well over a minute.
Also, want to point out that tableau / Rserve, set one argument after another and that takes time. My expectation is that once i have a vector, it would be just 1 transfer and setting of arguments, and therefore this should speed up
The first step in understanding how Tableau interacts with R or Python, is understanding how Tableau's table calcs work.
Tableau Script_XXX() functions are table calculations which means that you invoke them on a vector of aggregate query results and the corresponding R or Python code needs to return a vector usually of the same size. (I think you may be able to return a scalar or smaller vector which gets replicated to appear like a vector of the same size as the argument -- but not certain)
You can control how your data is partitioned into vectors, and also the ordering of data in the vectors, by editing the table calc to specify the partitioning and addressing for that calc.
Partitioning determines how your aggregate query results are broken up into vectors for calculation purposes. Addressing determines how the elements of each vector are ordered. You can either do that based on the physical layout of the table structure, or (better) based on the specific dimensions.
See the Tableau on-line help for table calcs for more info, and look online training videos from Tableau or blog entries (especially from anyone named Bora)
One way to test your understanding of these concepts is create a Tableau table (i.e., a viz with a mark type of text) with several dimensions on row and column shelves. Then create calculated fields for INDEX() and SIZE() and display them on text. Finally, change the partitioning and addressing in different ways by editing those table calcs. Try several different permutations. When you can confidently predict what those functions will produce for different settings, then you're ready to do more complex tasks - such as talking to R.
It is also instructive to experiment with FIRST(), LAST(), LOOKUP(), WINDOW_SUM() etc -- and finally dig into PREVIOUS_VALUE(). Warning, PREVIOUS_VALUE() is a bit odd, and does not behave the way you probably assume it does. Still, it is a useful technique that can implement a recursive calculation, and is about as close to a for loop as Tableau gets.

Tableau Dual Axis with different filters

I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select groups of players via their stats that meet certain criteria.
Once I have the subset of players I am interested in looking at further, I would like to find the mean of a column; eg Batting Average or RBIs. From there I would like to break all the players into percentile groups based on their average performance compared to all players; the top 10%, bottom 10%, 40-50%
I've been able to use the DataFrame.describe() function to return a summary of a desired column (mean, stddev, count, min, and max) all as strings though. Is there a better way to get just the mean and stddev as Doubles, and what is the best way of breaking the players into groups of 10-percentiles?
So far my thoughts are to find the values that bookend the percentile ranges and writing a function that groups players via comparators, but that feels like it is bordering on reinventing the wheel.
I was able to get the percentiles by using Windows Functions and apply ntile() and cumeDist() over the window. The ntile() can create grouping based off of an input number. If you want things grouped by 10%, just enter ntile(10), if by 5% then ntile(20). For a more fine-tuned restult, cumeDist() applied over the window will output a new column with the cumulative distribution, and those can be filtered from there through select(), where(), or a SQL query.

Quicksort pivote choice

I've read that the pivote can be the median of 3 numbers, bottom, middle and top. But, could that generate overflow? What happens if the median returns a value larger than the array size?
I assume that the this choice is by assuming that they array values can't be longer than the array size.
I think I'm confused at what a pivote really is.
The pivot is just the value that you compare other values against - lower values go the left, higher to the right. The pivot can be chosen by taking any of the existing values in the array. If the array is completely unsorted, it won't matter which value you choose. If it is somewhat sorted, you should choose a value from the middle of the array.
UPDATE: Some reading informs me that a better pivot choice may be to choose the median value of 3 values in the array (such as middle, bottom and top or 3 random positions). Some people advocate taking the median of 5 values. The worst-case performance of quicksort occurs when pivot is close to the smallest or largest value in the array, and this tactic is intended to defend against that occurring. This is just an optimisation for certain kinds of data - it is not a necessity.