K-Means Clustering in Py-spark - pyspark

I have a really big data frame. I am considering 8 columns to make 8 clusters. Now, normally the code would discard a row even if 1 of the 8 column values is NaN. In my case, I have to write a pyspark code that behaves in such a way that(Even When a row has some NaN values, the code will ignore them and take other column values into consideration and thereby performing clustering even for rows that have some NaN values)Any help or lead is highly appreciated.

Related

Complex multiple operations in Pyspark avoiding loops

My problem is simplified that:
I have several hundred of subtables within one large pyspark dataset visualized in the picture below. In that subtables there are missing values (marked in light blue) I have to calculate. I executed one operation shown in the picture. To calculate that value, I have to sum up three values to the left of the missing value and divide it through the value below. The formula is also shown in the picture.
There is no problem calculating the value in the picture, but in the next columns to the right, I have to use the results of the calculations in the columns to their left. My task is to fill out all the blue lighted cells within one subset and fill out all subsets.
My current solution is to export the pyspark dataset to pandas, creating a for loop over all columns (and subtables), read out each column (and the three to its left) within one loop to a numpy array, doing the calculation on that column and write back the result to the pandas dataframe. The algorithm is quite fast but if the number of subtables exceeds about 800 I get a memory overflow message.
Now I‘m looking for a solution in pyspark (or pandas) avoiding a loop and that out of memory proble, because the number of subtables is more than 6,000!
one subtable with formula for one missing value

Sparse table in MATLAB, is it possible?

\ am dealing with a matrix in MATLAB which is sparse and has many rows and columns. In this case, the row and columns of the matrix are the ids for particular items. Let's assume them as id1 and id2.
It would be nice if the ids for rows and columns could be embedded so I can have access to them easily to them without the need for creating extra variables that keep the two ids.
The answer would be probably to use a table data type. Tables are very ideal answer for my need however I was wondering if I could create a table data type for a sparse matrix?
A [m*n] sparse matrix %% m & n are huge
id1 [1*m] , id2 [1*n] %% two vectors containing numeric ids for rows and column
Could we obtain?
T [m*n] sparse table matrix
Thanks for sharing your view with me.
I will address the question and the comments in order to clear some confusion.
The short answer
There is no sparse table class in Matlab. Cannot do. Use sparse() matrices.
The long answer
There is a reason why sparse tables make little sense:
Philosophically speaking, the advantage of having nice row and column labels, is completely lost if you are working with a big panel of data and/or if the data is sparse.
Scrolling through 246829 rows and 33336 columns? Can only be useful at very isolated times if you are debugging your code and a specific outlier is causing you results to go off. Also, all you might see is just a sea of zeros.
Technically a table can have more columns for the same variable, i.e. table(rand(10,2), rand(10,1)) is a valid table. How would you consider define sparsity on such table?
Fine, suppose you are working with a matrix-like table, i.e. one element per table cell and same numeric class. Still, none of the algebraic operators are defined on a table(). So you need to extract the content first, in order to be able to perform any operation that spans more than a single column of data. Just to be clear, once the data is extracted, then you have e.g. your double (full) matrix or in an ideal case a double sparse matrix.
Now, a few misconceptions to clear:
Less variables implies clearer/cleaner code. Not true. You are probably thinking about the extreme case (in bad practices) of how do I make a series of variables a1, a2, a3, etc..
There is a sweet spot between verbosity and number of variables, amount of comments, and code clarity/maintainability. Only with time and experience you find the right balance.
Control over data cannot go without visual inspection. This approach does NOT scale with big data and the sooner you abandon it, the faster your code will become more reliable. You need to verify your results systematically, rather than relying on visual inspection. Failure to (visually) spot a problem in the data, grows exponentially with its dimension, faster than with systematic tests.
Some background info on my work:
I work with high-frequency prices, that's terabytes of data. I also extended the table() class with additional methods and fixes to help me with my work (see https://github.com/okomarov/tableutils), but I do not see how sparsity is a useful feature to add to table().

Big numbers and long loops in matlab?

How can I store a matrix with 2^100 rows in MatLab! it is my search space and I really need to do it .
In your opinion, is it possible ? if yes, please help me that how can i do it?
2100 is about 1030, which is much too large for you to fit in memory - so you won't be able to store this matrix.
A couple of alternatives that you might want to think about -
Are many of the entries in the matrix zero? If so, you could consider using a sparse matrix which is much more memory efficient.
Do you need to be able to access the rows in an arbitrary order, or sequentially? If sequentially, you can generate the rows on an as-needed basis (perhaps in blocks of ten thousand at a time)
Do you need to look at all the rows at all? If not, perhaps you can define a function which generates the entries on the fly, as they are requested.

How can I find the minimum number of lines needed to cover all the zeros in a 2 dimensional array?

I'm trying to make a decent implementation of the hungarian algorithm however I'm stuck at how to find the minimum number of lines that cover all the zeros in an array
also I need to know these lines to make some computations later
here is the explanation:
http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf
in step 3 it says
Use as few lines as possible to cover all the zeros in the matrix. There is no easy rule to do this - basically trial and error.
what does trial and error mean in terms of computation? If I have for example an 2d array of 5 rows and 5 columns then
The first row can cover all the zeros, the first and second, the first row and first column, etc etc too many combinations
isn't there something more efficient than this?
thanks in advance
You need to implement bipartite matching algorithm here. You have two partitions in the graph- the vertices in one of them represent the rows and the vertices in the other one represent the columns in the table. There is an edge between rowi and columnj iff there is a 1 in cell (i,j). Then you create maximum bipartite matching. After the last iteration of the bipartite matching algorithm you need to figure out which vertices are connected via a incremental path with the source for your matching. An incremental path is path using only edges with positive capacity. You should have picture like:
row_1 col_1
/ \
/ - row_2 col_2 -\
source - .... some_edges \ sink
\ /
\ - row_n col_n /
.... /
col_m
After you find the maximum bipartite matching, find which rows and columns are reachable via positive-capacity edges from sink. Now the minimum number of rows and columns you need to scratch is found using the following algorithm - you cross out all the rows that are not reachable from the source and all the columns that are reachable and this is your optimal solution.
Hope this answer helps you.
I'm not quite sure why they told you to do trial and error. The Hungarian algorithm, however, does not take exponential time. Take a look at wikipedia, which walks you through an example of how to figure out the minimum number of lines (look at Step 3):
http://en.wikipedia.org/wiki/Hungarian_algorithm#Matrix_interpretation
The article also includes links to implementations, and some online course notes which give more complete (although also more technical) descriptions of the Hungarian algorithm than the one you're using.
Trial and error means O((n+m)!) complexity.
At most you will only need to pick min(n,m) lines, as selecting all rows or columns will cover all 0s.
I would implement a dynamic programming algorithm to solve this step for large problems.

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.