KDB+/Q: implementation of numpy random choice, with probability distribution? - kdb

What is a canonical method to implement numpy random choice in kdb+/q?
Specifically how would one replicate the following selection
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
whereby a probability distribution is provided. roll, deal and permute don't seem to take into account a probability distribution?
Thanks.

I think a hacky way of doing it might be this:
q){[n;k;p]k?raze p#'til n}[5;3;1 0 3 6 0]
3 2 3
Where instead of giving a list of probabilities, you give a list of integers representing proportions (but would still represent probabilities).
I imagine there's a more canonical way of doing it though.
I think this works if you need probabilities though:
q){[n;k;p]k?raze ("j"$p*10 xexp max count each("."vs'string p)[;1])#'til n}[5;3;0.05 0 0.3 0.65 0]
2 3 3
Again, very hacky.
EDIT: as user20349 says in the comments, you can use an overload of where to do the above with one less argument:
q){[k;p]k?where p}[3;1 0 3 6 0]
3 0 3
q){[k;p]k?where("j"$p*10 xexp max count each("."vs'string p)[;1])}[3;0.05 0 0.3 0.65 0]
3 3 3

Related

Matlab nonlinear 2D figure plot

Let's say I have a set of data with x ranging from 0 to 5 (evenly distributed, say the spacing is 0.2), during this total x range, I actually have some fine features in terms of y values, say from 0 to 1 range. Beyond that, the features are coarse.
So now I want a figure to be like more zoomed into 0 to 1 range but still plotting out the total x range. The best scenario is of course not losing any data points.
Currently I do have one potential solution, where I pick up sparse points from the 1 to 5 range (spacing of 1 instead of 0.2) and plot out those data points evenly first. And then label with the correct corresponding x values as the following (those are just some random numbers I used here for explanation, the figure doesn't have fine features between 0 to 1 range though):
x=[0 0.2 0.4 0.6 0.8 1 2 3 4 5];
x1=0:9;
y=[0.1 0.2 0.3 0.4 0.5 1 2 3 4 5];
figure(1)
plot(x1,y,'-o')
set(gca,'xticklabel',x)
Figure plot
But this will obviously lose some information from the 1 to 5 range.
Is there a better way that I can still plot the whole range from 0 to 5 with the original data points but with a detailed showing of 0 to 1 range?
Thanks!
Not sure I understand. What about this?
x=[0:.2:1 2 3 4 5];
y=[0.1 0.2 0.3 0.4 0.5 1 2 3 4 5];
figure(1)
plot(x,y,'-o')
I would say the only proper way of having non-uniform intervals on the x-axis is by using a logarithmic scale. Try
semilogx(x,y,'-o')

MatLAB help: shuffling a predefined vector without consecutively repeating numbers (with equal occurrences of all values)

I'm having troubles with randomly shuffling a vector without repeating numbers (ex. 1 1 is not acceptable but 1 2 is acceptable), given that each value is repeated equally.
More specifically, I would like to repeat the matrix [1:4] ten times (40 elements in total) so that 1, 2, 3 and 4 would all repeat 10 times without being consecutive.
If there is any clarification needed please let me know, I hope this question was clear.
This is what I have so far:
cond_order = repmat([1:4],10,1); %make matrix
cond_order = cond_order(:); %make sequence
I know randperm is quite relevant but I'm not sure how to use it with the one condition of non-repeating numbers.
EDIT: Thank you for all the responses.
I realize I was quite unclear. These are the examples I would like to reject [1 1 2 2 4 4 4...].
So it doesn't matter if [1 2 3 4] occurs in that order as long as individual values are not repeated. (so both [1 2 3 4 1 2 3 4...] and [4 3 1 2...] are acceptable)
Preferably I am looking for a shuffled vector meeting the criteria that
it is random
there are no consecutively repeating values (ex. 1 1 4 4)
all four values appear equal amount of times
Kind of working with the rejection sampling idea, just repeating with randperm until a sequence permutation is found that has no repeated values.
cond_order = repmat(1:4,10,1); %//make matrix
N = numel(cond_order); %//number of elements
sequence_found = false;
while ~sequence_found
candidate = cond_order(randperm(N));
if all(diff(candidate) ~= 0) %// check if no repeated values
sequence_found = true;
end
end
result = candidate;
The solution from mikkola got it methodically right, but I think there is a more efficient way:
He chose to sample based on equal quantities and check for the difference. I chose to do it the other way round and ended up with a solution requiering much less iterations.
n=4;
k=10;
d=42; %// random number to fail first check
while(~all(sum(bsxfun(#eq,d,(1:n).'),2)==k)) %' //Check all numbers to appear k times.
d=mod(cumsum([randi(n,1,1),randi(n-1,1,(n*k)-1)]),n)+1; %generate new random sample, enforcing a difference of at least 1.
end
A subtle but important distinction: does the author need an equal probability of picking any feasible sequence?
A number of people have mentioned answers of the form, "Let's use randperm and then rearrange the sequence so that it's feasible." That may not work. What will make this problem quite hard is if the author needs an equal chance of choosing any feasible sequence. Let me give an example to show the problem.
Imagine the set of numbers [1 2 2 3 4]. First lets enumerate the set of feasible sequences:
6 sequences beginning with 1: [1 2 3 2 4], [1 2 3 4 2], [1 2 4 2 3], [1 2 4 3 2], [1 3 2 4 2], [1 4 2 3 2].
Then there are 6 sequences beginning with [2 1]: [2 1 2 3 4], [2 1 2 4 3], [2 1 3 2 4], [2 1 3 4 2], [2 1 4 2 3], [2 1 4 3 2]. By symmetry, there are 18 sequences beginning with 2 (i.e. 6 of [2 1], 6 of [2 3], 6 of [2 4]).
By symmetry there are 6 sequences beginning with 3 and another 6 starting with 4.
Hence there are 6 * 3 + 18 = 36 possible sequences.
Sampling uniformly from feasible sequences, the probability the first number is 2 is 18/36 = 50 percent! BUT if you just went with a random permutation, the probability the first digit is 2 would be 40 percent! (i.e. 2/5 numbers in set are 2)
If equal probability of any feasible sequence is required, you want 50 percent of a 2 as the first number, but naive use of randperm and then rejiggering numbers at 2:end to make sequence feasible would give you a 40 percent probability of the first digit being two.
Note that rejection sampling would get the probabilities right as every feasible sequence would have an equal probability of being accepted. (Of course rejection sampling becomes very slow as probability of being accepted goes towards 0.)
Following some of the discussion on here, I think that there is a trade-off between performance and the theoretical requirements of the application.
If a completely uniform draw from the set of all valid permutations is required, then pure rejection sampling method will probably be required. The problem with this of course is that as the size of the problem is increased, the rejection rate will become very high. To demonstrate this, if we consider the base example in the question being n multiples of [1 2 3 4] then we can see the number of samples rejected for each valid draw as follows (note the log y axis):
My alternative method is to randomly sort the array, and then if duplicates are detected then the remaining elements will again be randomly sorted:
cond_order = repmat(1:4,10,1); %make matrix
cond_order = reshape(cond_order, numel(cond_order), 1);
cond_order = cond_order(randperm(numel(cond_order)));
i = 2;
while i < numel(cond_order)
if cond_order(i) ~= cond_order(i - 1)
i = i + 1;
else
tmp = cond_order(i:end);
cond_order(i:end) = tmp(randperm(numel(tmp)));
end
end
cond_order
Note that there is no guarantee that this will converge, but in the case where is becomes clear that it will not converge, we can just start again and it will still be better that re-computing the whole sequence.
This definitely meets the second two requirements of the question:
B) there are no consecutive values
C) all 4 values appear equal amount of times
The question is whether it meets the first 'Random' requirement.
If we take the simplest version of the problem, with the input of [1 2 3 4 1 2 3 4] then there are 864 valid permutations (empirically determined!). If we run both methods over 100,000 runs, then we would expect a Gaussian distribution around 115.7 draws per permutation.
As expected, the pure rejection sampling method gives this:
However, my algorithm does not:
There is clearly a bias towards certain samples.
In the end, it depends on the requirements. Both methods sample over the whole distribution so both fill the core requirements of the problem. I have not included performance comparisons, but for anything other than the simplest of cases, I am confident that my algorithm would be much faster. However, the distribution of the draws is not perfectly uniform. Whether it is good enough is dependent on the application and the size of the actual problem.

replace zero values with previous non-zero values

I need a fast way in Matlab to do something like this (I am dealing with huge vectors, so a normal loop takes forever!):
from a vector like
[0 0 2 3 0 0 0 5 0 0 7 0]
I need to get this:
[NaN NaN 2 3 3 3 3 5 5 5 7 7]
Basically, each zero value is replaced with the value of the previous non-zero one. The first are NaN because there is no previous non-zero element
in the vector.
Try this, not sure about speed though. Got to run so explanation will have to come later if you need it:
interp1(1:nnz(A), A(A ~= 0), cumsum(A ~= 0), 'NearestNeighbor')
Try this (it uses the cummax function, introduced in R2014b):
i1 = x==0;
i2 = cummax((1:numel(x)).*~i1);
x(i1&i2) = x(i2(i3));
x(~i2) = NaN;
Just for reference, here are some similar/identical functions from exchange central and/or SO columns.
nearestpoint ,
try knnimpute function.
Or best of all, a function designed to do exactly your task:
repnan (obviously, first replace your zero values with NaN)
I had a similar problem once, and decided that the most effective way to deal with it is to write a mex file. The c++ loop is extremely trivial. After you'l figure out how to work with mex interface, it will be very easy.

MATLAB: Subtracting matrix subsets by specific rows

Here is an example of a subset of the matrix I would like to use:
1 3 5
2 3 6
1 1 1
3 5 4
5 5 5
8 8 0
This matrix is in fact 3000 x 3.
For the first 3 rows, I wish to subtract each of these rows with the first row of these three.
For the second 3 rows, I wish to subtract each of these rows with the first of these three, and so on.
As such, the output matrix will look like:
0 0 0
1 0 1
0 -2 -4
0 0 0
2 0 1
5 3 -4
What code in MATLAB will do this for me?
You could also do this completely vectorized by using mat2cell, cellfun, then cell2mat. Assuming our matrix is stored in A, try:
numBlocks = size(A,1) / 3;
B = mat2cell(A, 3*ones(1,numBlocks), 3);
C = cellfun(#(x) x - x([1 1 1], :), B, 'UniformOutput', false);
D = cell2mat(C); %//Output
The first line figures out how many 3 x 3 blocks we need. This is assuming that the number of rows is a multiple of 3. The second line uses mat2cell to decompose each 3 x 3 block and places them into individual cells. The third line then uses cellfun so that for each cell in our cell array (which is a 3 x 3 matrix), it takes each row of the 3 x 3 matrix and subtracts itself with the first row. This is very much like what #David did, except I didn't use repmat to minimize overhead. The fourth line then takes each of these matrices and stacks them back so that we get our final matrix in the end.
Example (this is using the matrix that was defined in your post):
A = [1 3 5; 2 3 6; 1 1 1; 3 5 4; 5 5 5; 8 8 0];
numBlocks = size(A,1) / 3;
B = mat2cell(A, 3*ones(1, numBlocks), 3);
C = cellfun(#(x) x - x([1 1 1], :), B, 'UniformOutput', false);
D = cell2mat(C);
Output:
D =
0 0 0
1 0 1
0 -2 -4
0 0 0
2 0 1
5 3 -4
In hindsight, I think #David is right with respect to performance gains. Unless this code is repeated many times, I think the for loop will be more efficient. Either way, I wanted to provide another alternative. Cool exercise!
Edit: Timing and Size Tests
Because of our discussion earlier, I have decided to do timing and size tests. These tests were performed on an Intel i7-4770 # 3.40 GHz CPU with 16 GB of RAM, using MATLAB R2014a on Windows 7 Ultimate. Basically, I did the following:
Test #1 - Set the random seed generator to 1 for reproducibility. I wrote a loop that cycled 10000 times. For each iteration in the loop, I generated a random integer 3000 x 3 matrix, then performed each of the methods that were described here. I took note of how long it took for each method to complete after 10000 cycles. The timing results are:
David's method: 0.092129 seconds
rayryeng's method: 1.9828 seconds
natan's method: 0.20097 seconds
natan's bsxfun method: 0.10972 seconds
Divakar's bsxfun method: 0.0689 seconds
As such, Divakar's method is the fastest, followed by David's for loop method, followed closely by natan's bsxfun method, followed by natan's original kron method, followed by the sloth (a.k.a mine).
Test #2 - I decided to see how fast this would get as you increase the size of the matrix. The set up was as follows. I did 1000 iterations, and at each iteration, I increase the size of the matrix rows by 3000 each time. As such, iteration 1 consisted of a 3000 x 3 matrix, the next iteration consisted of a 6000 x 3 matrix and so on. The random seed was set to 1 again. At each iteration, the time taken to complete the code was taken a note of. To ensure fairness, the variables were cleared at each iteration before the processing code began. As such, here is a stem plot that shows you the timing for each size of matrix. I subsetted the plot so that it displays timings from 200000 x 3 to 300000 x 3. Take note that the horizontal axis records the number of rows at each iteration. The first stem is for 3000 rows, the next is for 6000 rows and so on. The columns remain the same at 3 (of course).
I can't explain the random spikes throughout the graph.... probably attributed to something happening in RAM. However, I'm very sure I'm clearing the variables at each iteration to ensure no bias. In any case, Divakar and David are closely tied. Next comes natan's bsxfun method, then natan's kron method, followed last by mine. Interesting to see how Divakar's bsxfun method and David's for method are side-by-side in timing.
Test #3 - I repeated what I did for Test #2, but using natan's suggestion, I decided to go on a logarithmic scale. I did 6 iterations, starting at a 3000 x 3 matrix, and increasing the rows by 10 fold after. As such, the second iteration had 30000 x 3, the third iteration had 300000 x 3 and so on, up until the last iteration, which is 3e8 x 3.
I have plotted on a semi-logarithmic scale on the horizontal axis, while the vertical axis is still a linear scale. Again, the horizontal axis describes the number of rows in the matrix.
I changed the vertical limits so we can see most of the methods. My method is so poor performing that it would squash the other timings towards the lower end of the graph. As such, I changed the viewing limits to take my method out of the picture. Essentially what was seen in Test #2 is verified here.
Here's another way to implement this with bsxfun, slightly different from natan's bsxfun implementation -
t1 = reshape(a,3,[]); %// a is the input matrix
out = reshape(bsxfun(#minus,t1,t1(1,:)),[],3); %// Desired output
a slightly shorter and vectorized way will be (if a is your matrix) :
b=a-kron(a(1:3:end,:),ones(3,1));
let's test:
a=[1 3 5
2 3 6
1 1 1
3 5 4
5 5 5
8 8 0]
a-kron(a(1:3:end,:),ones(3,1))
ans =
0 0 0
1 0 1
0 -2 -4
0 0 0
2 0 1
5 3 -4
Edit
Here's a bsxfun solution (less elegant, but hopefully faster):
a-reshape(bsxfun(#times,ones(1,3),permute(a(1:3:end,:),[2 3 1])),3,[])'
ans =
0 0 0
1 0 1
0 -2 -4
0 0 0
2 0 1
5 3 -4
Edit 2
Ok, this got me curios, as I know bsxfun starts to be less efficient for bigger array sizes. So I tried to check using timeit my two solutions (because they are one liners it's easy). And here it is:
range=3*round(logspace(1,6,200));
for n=1:numel(range)
a=rand(range(n),3);
f=#()a-kron(a(1:3:end,:),ones(3,1));
g=#() a-reshape(bsxfun(#times,ones(1,3),permute(a(1:3:end,:),[2 3 1])),3,[])';
t1(n)=timeit(f);
t2(n)=timeit(g);
end
semilogx(range,t1./t2);
So I didn't test for the for loop and Divkar's bsxfun, but you can see that for arrays smaller than 3e4 kron is better than bsxfun, and this changes at larger arrays (ratio of <1 means kron took less time given the size of the array). This was done at Matlab 2012a win7 (i5 machine)
Simple for loop. This does each 3x3 block separately.
A=randi(5,9,3)
B=A(1:3:end,:)
for i=1:length(A(:,1))/3
D(3*i-2:3*i,:)=A(3*i-2:3*i,:)-repmat(B(i,:),3,1)
end
D
Whilst it may be possible to vectorise this, I don't think the performance gains would be worth it, unless you will do this many times. For a 3000x3 matrix it doesn't take long at all.
Edit: In fact this seems to be pretty fast. I think that's because Matlab's JIT compilation can speed up simple for loops well.
You can do it using just indexing:
a(:) = a(:) - a(3*floor((0:numel(a)-1)/3)+1).';
Of course, the 3 above can be replaced by any other number. It works even if that number doesn't divide the number of rows.

Hidden Markov Model Multiple Observation values for each state

I am new to Hidden Markov Model. I understand the main idea and I have tried some Matlab built-in HMM functions to help me understand more.
If I have a sequence of observations and corresponding states,
e.g.
seq = 2 6 6 1 4 1 1 1 5 4
states = 1 1 2 2 2 2 2 2 2 2
and I can use hmmestimate function to calculate transition and emission probability matrices as:
[TRANS_EST, EMIS_EST] = hmmestimate(seq, states)
TRANS_EST =
0.5000 0.5000
0 1.0000
EMIS_EST =
0 0.5000 0 0 0 0.5000
0.5000 0 0 0.2500 0.1250 0.1250
In the example, the observation is just a single value.
The example picture below describes my situation.
If I have states: {Sleep, Work, Sport}, and I have a set of observations: {lightoff, light on, heart rate>100 .....}
If I use number to represent each observation, in my situation each state has multiple observations at the same time,
seq = {2,3,5} {6,1} {2} {2,3,6} {4} {1,2} {1}
states = 1 1 2 2 2 2 2
I have no idea how to implement this in Matlab to get transition and emission probability matrix. I am quite lost, what shall I do in the next step? Am I using the right approach?
Thanks!
If you know the hidden state sequence, then max likelihood estimation is trivial: it's the normalized empirical counts. In other words, count up the transitions and emissions and then divide the elements in each row by the total counts in that row.
In the case where you have multiple observation variables, code the observations as a vector where each element gives the value of one of the random variables on that time step, e.g. '{lights=1, computer=0, Heart Rate >100 = 1, location =0}'. The key is that you need to have the same number of observations at each time step or else things will be much more difficult.
I think you have two options.
1) code multiple observations into one number. For example, if you know that the maximal possible value for the observation is N, and at each state you may have at most K observations, then you can code any combinations of observations as a number between 0 and N^K - 1. By doing this, you are assuming that {2,3,6} and {2,3,5} do not share anything, they are completely different two observations.
2) Or you can have multiple emission distributions for each state. I haven't used the built-in functions in matlab for HMM estimation, so I have no idea whether or not it supports that. But the idea is, if you have multiple emission distributions at a state, the emission likelihood is just the product of them. This is what jerad suggests.