aggregate sets into bigger sets in pyspark - pyspark

In my dataset, I have a column A that is a list of integers, that represent a set of integers.
I want to aggregate all these sets into the set that represents the union (still represented as a list I guess).
Is there a "simple" way to do this?
What I've done is:
agg(array_distinct(flatten(collect_list("A")))
but this seems inefficient as it seems that at some point, there will be the totally flatened list in memory that contains all duplicates

Related

Complex multiple operations in Pyspark avoiding loops

My problem is simplified that:
I have several hundred of subtables within one large pyspark dataset visualized in the picture below. In that subtables there are missing values (marked in light blue) I have to calculate. I executed one operation shown in the picture. To calculate that value, I have to sum up three values to the left of the missing value and divide it through the value below. The formula is also shown in the picture.
There is no problem calculating the value in the picture, but in the next columns to the right, I have to use the results of the calculations in the columns to their left. My task is to fill out all the blue lighted cells within one subset and fill out all subsets.
My current solution is to export the pyspark dataset to pandas, creating a for loop over all columns (and subtables), read out each column (and the three to its left) within one loop to a numpy array, doing the calculation on that column and write back the result to the pandas dataframe. The algorithm is quite fast but if the number of subtables exceeds about 800 I get a memory overflow message.
Now I‘m looking for a solution in pyspark (or pandas) avoiding a loop and that out of memory proble, because the number of subtables is more than 6,000!
one subtable with formula for one missing value

Show number of elements in multiple sets in a chart

I create about 10 sets using my tableau data. I want to show the number of elements in all sets in a chart, for example, bubble chart, or bar chart. When I move a single set to the sheet and select the number of records and filter the in elements I can see the number of elements in the set, however, I want to simultaneously see the number of records in multiple sets.
When I try to put multiple sets to a for example bubble chart, Tableau creates one single bubble instead of multiple bubbles.
Sets are very useful, but may not be the best approach when you have a very large number of similar groupings to compare side by side when you are using them as dimensions.
Remember the purpose of dimensions is to partition your data into non overlapping blocks prior to aggregating measures. Since a data row may belong to multiple sets, using sets as dimensions doesn't fit the particular application you describe. (but using sets as filters or building blocks for calculations might)
So here is one approach that will give you some flexibility. Define a calculated field for each set to return 1 if the record is in set 1, null otherwise (One way to think of sets is as a boolean function)
Number of Set 1 Records
if [Set_1] then 1 end
Then you you can use SUM([Number of Set 1 Records]) as a measure as desired. You can use Measure Values to display multiple measures together.
This way your set definitions are used for calculating your measures, but not for partitioning the data rows.
If your sets are completely defined by a condition, and this is the only way you use them, you could simplify by using the condition directly in the calculated fields above and not creating the corresponding sets.

plans and subset of rows in pyfftw

I want to use pyfft to repeatedly compute the discrete Fourier transform of a subset of rows for a two-dimensional array. I do not know in advance which rows I need to transform, that depends on the output from the previous round. I do know that doing it for all rows is wasteful.
It is my understanding that a 'plan' in FFTW3 is associated with the type of transform (c2c, r2c, etc) and the input/output length, which is always a vector in the 1D case. In pyfftw it looks like a 'plan' is associated to the type of transform and the input/output shape, so my interpretation is that it uses the same FFTW3 plan for every row.
My question is: is it possible to use the same FFTW3 plan for some of the rows, without creating separate pyfftw.FFTW objects for all possible combinations of rows?
On a different note, I am wondering how pyfftw uses multiple cores: does it use multiple cores for each row (this appears natural in view of FFTW3 documentation) or does it farm out different rows to different cores (which was my initial assumption)?
If you can create a numpy array from a view, you can plan for it with pyFFTW - all valid numpy arrays should work just fine.
This means several things:
Your array needs to have regular strides, but those strides can be arbitrary.
ND arrays are planned as ND transforms, with the selected axes being used.
You can probably do something cunning with stride tricks and it will probably work (but might not do what you expect if you do something too nefarious like overlapping rows and then use threads).
One solution that I've used quite a bit is to copy the rows that you want to transform into an interim array, and transforming that. You might well find that's the fastest option (particularly when you can allow for getting the byte offset correct).
Obviously, this doesn't work if you always have a different number of rows. You might still find that if you plan for the largest number of rows that are transformed and then copy in a subset you still do faster than otherwise.
The problem you're going to come up against, even if you go down to the C level, is that the planning overhead might well dominate if you're changing your transform sizes often.
You could also try pyfftw.interfaces.numpy_fft which is normally faster that numpy and has the ability to cache repeated transform sizes.

Intersecting two tables with one common row elements in matlab

I have two different tables (.csv files) as:
I need to merge these two tables in MATLAB, while intersecting first columns of both the tables. I want to make a new separate table with six number of columns(combined columns of both the tables) and number of rows will be equal to the number of intersecting elements of first column of both the tables.
How should I do the intersection and merging of these two tables?
I'm proposing an answer. I'm not claiming it is the best answer. In fact, I think there are probably much faster ones! Also note that I do not have MATLAB in front of me right now and can't test this. There might be some mistakes.
First of all, read the .csv files into memory. In table 1, convert the first column into numeric data (currently, it looks like they are strings). After this step, you want to have two double matricies I'll call table1 (which is 3296x5) and table2 (which is 3184x3).
Second, (this is where it gets mildly interesting, step 1 was the boring stuff), is to find all IDs that are common to both tables. This can be done by calling commonIDs=unique([table1(:,1) ; table2(:,1)]).
Third, get the indicies of the common rows for table1. Then repeat for table2. This is done using the ismember function as follows:
goodEntries1=ismember(table1(:,1),commonIDs);
goodEntries2=ismember(table2(:,1),commonIDs);
Lastly, we extract data and combine to get a result. Note that I only include the ID column once:
result=[table1(goodEntries1,:) table2(goodEntries2,2:end)];
You will need to test this to make sure it is robust. I think that this will keep the right rows together, but depending on how ismember works, you might end up combining rows out of order (for instance, table1's ID=5 with table2's ID=6).

Hash a Sequence of positive/negative integers

I have a file with millions of lines (actually it's an online stream of data, which means we are receiving it line by line) , each line consists of an array of integers which is not sorted (positive and negative) there's no limit for the each number and the lengths are different and we might have duplicate values in one line,
I want to remove the duplicate lines (if 2 lines have same values regardless of how they are ordered we consider them duplicate), is there any good hashing function ?
We want to do this in O(n) while n is number of lines (we can assume that the maximum possibele number of elements in each line is constant, e.g. we have maximum of 100 elements in each line)
I've read some of the questions posted here in stackoverflow and I also googled it, most of them were for the cases where the arrays are of the same length or the integers are positive or even or they are sorted, is there any way to solve this in general case ?
My solution:
First we sort each line with the use of O(n) sorting algorithm e.g. Counting sort , then we put them into a string and then we use md5 hashing to put them into a hashset. If it's not in the set we put it into that set, if it's already in the list we check the arrays with the same hash value.
Problem with the solution : sorting using the Counting Sort takes a lot of space as there's no limit for the numbers and the collisions are possible .
The problem with using a hashing algorithm on a set of data this large is that you have a high probability of two different lines hashing to the same value. You want to stay in O(n) but I am not sure that is possible, with the size of the data and accuracy needed. If you use heapsort, which is space efficient and then traverse down the new sorted data removing consecutive lines that are the same you could accomplish this in O(nlogn)