How can I find the minimum number of lines needed to cover all the zeros in a 2 dimensional array? - hungarian-algorithm

I'm trying to make a decent implementation of the hungarian algorithm however I'm stuck at how to find the minimum number of lines that cover all the zeros in an array
also I need to know these lines to make some computations later
here is the explanation:
http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf
in step 3 it says
Use as few lines as possible to cover all the zeros in the matrix. There is no easy rule to do this - basically trial and error.
what does trial and error mean in terms of computation? If I have for example an 2d array of 5 rows and 5 columns then
The first row can cover all the zeros, the first and second, the first row and first column, etc etc too many combinations
isn't there something more efficient than this?
thanks in advance

You need to implement bipartite matching algorithm here. You have two partitions in the graph- the vertices in one of them represent the rows and the vertices in the other one represent the columns in the table. There is an edge between rowi and columnj iff there is a 1 in cell (i,j). Then you create maximum bipartite matching. After the last iteration of the bipartite matching algorithm you need to figure out which vertices are connected via a incremental path with the source for your matching. An incremental path is path using only edges with positive capacity. You should have picture like:
row_1 col_1
/ \
/ - row_2 col_2 -\
source - .... some_edges \ sink
\ /
\ - row_n col_n /
.... /
col_m
After you find the maximum bipartite matching, find which rows and columns are reachable via positive-capacity edges from sink. Now the minimum number of rows and columns you need to scratch is found using the following algorithm - you cross out all the rows that are not reachable from the source and all the columns that are reachable and this is your optimal solution.
Hope this answer helps you.

I'm not quite sure why they told you to do trial and error. The Hungarian algorithm, however, does not take exponential time. Take a look at wikipedia, which walks you through an example of how to figure out the minimum number of lines (look at Step 3):
http://en.wikipedia.org/wiki/Hungarian_algorithm#Matrix_interpretation
The article also includes links to implementations, and some online course notes which give more complete (although also more technical) descriptions of the Hungarian algorithm than the one you're using.

Trial and error means O((n+m)!) complexity.
At most you will only need to pick min(n,m) lines, as selecting all rows or columns will cover all 0s.
I would implement a dynamic programming algorithm to solve this step for large problems.

Related

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Sparse table in MATLAB, is it possible?

\ am dealing with a matrix in MATLAB which is sparse and has many rows and columns. In this case, the row and columns of the matrix are the ids for particular items. Let's assume them as id1 and id2.
It would be nice if the ids for rows and columns could be embedded so I can have access to them easily to them without the need for creating extra variables that keep the two ids.
The answer would be probably to use a table data type. Tables are very ideal answer for my need however I was wondering if I could create a table data type for a sparse matrix?
A [m*n] sparse matrix %% m & n are huge
id1 [1*m] , id2 [1*n] %% two vectors containing numeric ids for rows and column
Could we obtain?
T [m*n] sparse table matrix
Thanks for sharing your view with me.
I will address the question and the comments in order to clear some confusion.
The short answer
There is no sparse table class in Matlab. Cannot do. Use sparse() matrices.
The long answer
There is a reason why sparse tables make little sense:
Philosophically speaking, the advantage of having nice row and column labels, is completely lost if you are working with a big panel of data and/or if the data is sparse.
Scrolling through 246829 rows and 33336 columns? Can only be useful at very isolated times if you are debugging your code and a specific outlier is causing you results to go off. Also, all you might see is just a sea of zeros.
Technically a table can have more columns for the same variable, i.e. table(rand(10,2), rand(10,1)) is a valid table. How would you consider define sparsity on such table?
Fine, suppose you are working with a matrix-like table, i.e. one element per table cell and same numeric class. Still, none of the algebraic operators are defined on a table(). So you need to extract the content first, in order to be able to perform any operation that spans more than a single column of data. Just to be clear, once the data is extracted, then you have e.g. your double (full) matrix or in an ideal case a double sparse matrix.
Now, a few misconceptions to clear:
Less variables implies clearer/cleaner code. Not true. You are probably thinking about the extreme case (in bad practices) of how do I make a series of variables a1, a2, a3, etc..
There is a sweet spot between verbosity and number of variables, amount of comments, and code clarity/maintainability. Only with time and experience you find the right balance.
Control over data cannot go without visual inspection. This approach does NOT scale with big data and the sooner you abandon it, the faster your code will become more reliable. You need to verify your results systematically, rather than relying on visual inspection. Failure to (visually) spot a problem in the data, grows exponentially with its dimension, faster than with systematic tests.
Some background info on my work:
I work with high-frequency prices, that's terabytes of data. I also extended the table() class with additional methods and fixes to help me with my work (see https://github.com/okomarov/tableutils), but I do not see how sparsity is a useful feature to add to table().

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

tf-idf - accessing a large sparse scipy matrix & getting the highest values

For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to still get the high values for the tfidf, which could include words with low tf. One idea I looked up is doing something like tf_idf_matrix.sum(axis=0), which would sum up the columns. This works in my code, but because of 113k columns, print wont show them all. If I could use something like argsort() to access the top K column sum values, that would be helpful.
This question stems off my original question which is here.
The reason is that I want to know which words are the ones I should look at closer, and not necessarily the ones that have the highest frequency. I would also like to know about the "anomalies" that is, words that might not appear in all or many documents/posts but could have a high tfidf in a one or fewer documents. In case there are other approaches I should consider, I wanted to explain this.
Thanks

What element of the array would be the median if the the size of the array was even and not odd?

I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.