I have a data file which contains time data. The list is quite long, 100,000+ points. There is data every 0.1 seconds, and the time stamps are so:
'2010-10-10 12:34:56'
'2010-10-10 12:34:56.1'
'2010-10-10 12:34:56.2'
'2010-10-10 12:34:53.3'
etc.
Not every 0.1 second interval is necessarily present. I need to check whether a 0.1 second interval is missing, then insert this missing time into the date vector. Comparing strings seems unnecessarily complicated. I tried comparing seconds since midnight:
date_nums=datevec(time_stamps);
secs_since_midnight=date_nums(:,4)*3600+date_nums(:,5)*60+date_nums(:,6);
comparison_secs=linspace(0,86400,864000);
res=(ismember(comparison_secs,secs_since_midnight)~=1);
However this approach doesn't work due to rounding errors. Both the seconds since midnight and the linspace of the seconds to compare it to never quite equal up (due to the tenth of a second resolution?). The intent is to later do an fft on the data associated with the time stamps, so I want as much uniform data as possible (the data associated with the missing intervals will be interpolated). I've considered blocking it into smaller chunks of time and just checking the small chunks one at a time, but I don't know if that's the best way to go about it. Thanks!
Multiply your numbers-of-seconds by 10 and round to the nearest integer before comparing against your range.
There may be more efficient ways to do this than ismember. (I don't know offhand how clever the implementation of ismember is, but if it's The Simplest Thing That Could Possibly Work then you'll be taking O(N^2) time that way.) For instance, you could use the timestamps that are actually present (as integer numbers of 0.1-second intervals) as indices into an array.
Since you're concerned with missing data records and not other timing issues such as a drifting time channel, you could check for missing records by converting the time values to seconds, doing a DIFF and finding those first differences that are greater than some tolerance. This would tell you the indices where the missing records should go. It's then up to you to do something about this. Remember, if you're going to use this list of indices to fill the gaps, process the list in descending index order since inserting the records will cause the index list to be unsynchronized with the data.
>> time_stamps = now:.1/86400:now+1; % Generate test data.
>> time_stamps(randi(length(time_stamps), 10, 1)) = []; % Remove 10 random records.
>> t = datenum(time_stamps); % Convert to date numbers.
>> t = 86400 * t; % Convert to seconds.
>> index = find(diff(t) > 1.999 * 0.1)' + 1 % Find missing records.
index =
30855
147905
338883
566331
566557
586423
642062
654682
733641
806963
Related
I have voltage and current signals from multiple days. The time vector is in seconds of the day (SOD), and the voltage and current vectors are in volts and amps respectively. However, the vector data from each day is different lengths. For example Mondays data might be 1x100000 for both time and voltage/current, and Tuesdays might be 1x50000 for both time and voltage/current. I was asked to plot the different days of data on the same figure for comparison purposes. I have tried using the plot(x1,y1,x2,y2) method but that obviously didn't work due to different vector lengths. I tried interpolating to the larger data set, but then realized that I will get all NaNs on the result since there is no overlap in time. I ran out of ideas and am desperately in need of help.
EDIT:
I guess I forgot to mention that somehow I would like to overlay them one on top of the other in the same figure and not using a subplot.
It sounds like you want a data vector of length n to span, I'm guessing, 24 hours = 86400 seconds, for any n (e.g. n=100000 or n=50000). Assuming the original data is uniformly sampled, this should do the trick:
x1=linspace(0,86400,length(x1));
x2=linspace(0,86400,length(x2));
plot(x1,y1,'r-',x2,y2,'b-');
If it is not uniformly sampled, we can still make it work:
t1=linspace(0,86400,length(x1));
t2=linspace(0,86400,length(x2));
newy1 = spline(x1,y1,t1);
newy2 = spline(x2,y2,t2);
plot(t1,newy1,'r-',t2,newy2,'b-');
I am processing 1Hz timestamps (variable 'timestamp_1hz') from a logger which doesn't log exactly at the same time every second (the difference varies from 0.984 to 1.094, but sometimes 0.5 or several seconds if the logger burps). The 1Hz dataset is used to build a 10 minute averaged dataset, and each 10 minute interval must have 600 records. Because the logger doesn't log exactly at the same time every second, the timestamp slowly drifts through the 1 second mark. Issues come up when the timestamp cross the 0 mark, as well as the 0.5 mark.
I have tried various ways to pre-process the timestamps. The timestamps with around 1 second between them should be considered valid. A few examples include:
% simple
% this screws up around half second and full second values
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
rawsecondstest = rawseconds;
rawsecondstest(:,1) = floor(rawseconds(:,1))+ rawseconds(1,1);
% more complicated
% this screws up if there is missing data, then the issue compounds because k+1 timestamp is dependent on k timestamp
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
A = diff(rawseconds);
numcheck = rawseconds(1,1);
integ = floor(numcheck);
fract = numcheck-integ;
if fract>0.5
rawseconds(1,1) = rawseconds(1,1)-0.5;
end
for k=2:length(rawseconds)
rawsecondstest(k,1) = rawsecondstest(k-1,1)+round(A(k-1,1));
end
I would like to pre-process the timestamps then compare it to a contiguous 1Hz timestamp using 'intersect' in order to find the missing, repeating, etc data such as this:
% pull out the time stamp (round to 1hz and convert to serial number)
timestamp_1hz=round((datenum(raw_1hz_new(:,[1:6])))*86400)/86400;
% calculate new start time and end time to find contig time
starttime=min(timestamp_1hz);
endtime=max(timestamp_1hz);
% determine the contig time
contigtime=round([floor(mean([starttime endtime])):1/86400:ceil(mean([starttime endtime]))-1/86400]'*86400)/86400;
% find indices where logger time stamp matches real time and puts
% the indices of a and b
clear Ia Ib Ic Id
[~,Ia,Ib]=intersect(timestamp_1hz,contigtime);
% find indices where there is a value in real time that is not in
% logger time
[~,Ic] = setdiff(contigtime,timestamp_1hz);
% finds the indices that are unique
[~,Id] = unique(timestamp_1hz);
You can download 10 days of the raw_1hz_new timestamps here. Any help or tips would be much appreciated!
The problem you have is that you can't simply match these stamps up to a list of times, because you could be expecting a set of datapoints at seconds = 1000, 1001, 1002, but if there was an earlier blip you could have entirely legitimate data at 1000.5, 1001.5, 1002.5 instead.
If all you want is a list of valid times/their location in your series, why not just something like (times in seconds):
A = diff(times); % difference between times
n = find(abs(A-1)<0.1) % change 0.1 to whatever your tolerance is
times2 = times(n+1);
times2 should then be a list of all your timestamps where the previous timestamp was approximately 1 second ago - works on a small set of fake data I constructed, didn't try it on yours. (For future reference: it would be more help to provide a small subset of your data, e.g. just a few minutes worth, that you know contains a blip).
I would then take the list of valid timestamps and split it up into 10 minute sections for averaging, counting how many valid timestamps were obtained in each section. If it's working, you should end up with no more than 600 - but not much less if the blips are occasional.
Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.
I'm trying to assign ~1 Million values to a 100x100 logical matrix like this:
CC(Labels,LabelsXplusOne) = true;
where CC is 100x100 logical and Labels, LabelsXplusOne are 1024x768 int32.
The problem now is the above statement takes about as long as 5 minutes to complete on a modern CPU.
Obviously it is badly implemented in MATLAB, so how can we make the above run faster without resorting to loops?
In case you are wondering, i need this statement to compute blobs in a integer (not binary) image.
And also:
max(max(Labels)) = 100
max(max(LabelsXplusOne)) = 100
EDIT:
Ok i got it. Maybe this will help others in the future:
tic; CC(sub2ind(size(CC),Labels,LabelsXplusOne)) = true; toc;
Elapsed time is 0.026414 seconds.
Much better now.
There are a couple of issues that jump out at me...
I have the feeling you are doing the matrix indexing wrong. As it stands now, what will happen is every value in Labels will be paired with every value in LabelsXplusOne, producing (1024*768)^2 total index pairs for your rows and columns of CC. That's likely what's taking so long.
What you probably want is to only use each pair of values as indices, like Labels(1,1),LabelsXplusOne(1,1), Labels(1,2),LabelsXplusOne(1,2), etc. To do this, you should convert your indices into linear indices using the function SUB2IND.
Additionally, your matrix CC only contains 10,000 entries, yet your index matrices each contain 786,432 integer values. This means you will end up assigning the value true to the same entry in CC many times over. You should first remove redundant sets of indices using the function UNIQUE, then use them to assign values to CC.
This is what I think you want:
CC(unique(sub2ind(size(CC), Labels, LabelsXplusOne))) = true;
I have two arrays of data that I'm trying to amalgamate. One contains actual latencies from an experiment in the first column (e.g. 0.345, 0.455... never more than 3 decimal places), along with other data from that experiment. The other contains what is effectively a 'look up' list of latencies ranging from 0.001 to 0.500 in 0.001 increments, along with other pieces of data. Both data sets are X-by-Y doubles.
What I'm trying to do is something like...
for i = 1:length(actual_latency)
row = find(predicted_data(:,1) == actual_latency(i))
full_set(i,1:4) = [actual_latency(i) other_info(i) predicted_info(row,2) ...
predicted_info(row,3)];
end
...in order to find the relevant row in predicted_data where the look up latency corresponds to the actual latency. I then use this to created an amalgamated data set, full_set.
I figured this would be really simple, but the find function keeps failing by throwing up an empty matrix when looking for an actual latency that I know is in predicted_data(:,1) (as I've double-checked during debugging).
Moreover, if I replace find with a for loop to do the same job, I get a similar error. It doesn't appear to be systematic - using different participant data sets throws it up in different places.
Furthermore, during debugging mode, if I use find to try and find a hard-coded value of actual_latency, it doesn't always work. Sometimes yes, sometimes no.
I'm really scratching my head over this, so if anyone has any ideas about what might be going on, I'd be really grateful.
You are likely running into a problem with floating point comparisons when you do the following:
predicted_data(:,1) == actual_latency(i)
Even though your numbers appear to only have three decimal places of precision, they may still differ by very small amounts that are not being displayed, thus giving you an empty matrix since FIND can't get an exact match.
One feature of floating point numbers is that certain numbers can't be exactly represented, since they aren't an integer power of 2. This occurs with the numbers 0.1 and 0.001. If you repeatedly add or multiply one of these numbers you can see some unexpected behavior. Amro pointed out one example in his comment: 0.3 is not exactly equal to 3*0.1. This can also be illustrated by creating your look-up list of latencies in two different ways. You can use the normal colon syntax:
vec1 = 0.001:0.001:0.5;
Or you can use LINSPACE:
vec2 = linspace(0.001,0.5,500);
You'd think these two vectors would be equal to one another, but think again!:
>> isequal(vec1,vec2)
ans =
0 %# FALSE!
This is because the two methods create the vectors by performing successive additions or multiplications of 0.001 in different ways, giving ever so slightly different values for some entries in the vector. You can take a look at this technical solution for more details.
When comparing floating point numbers, you should therefore do your comparisons using some tolerance. For example, this finds the indices of entries in the look-up list that are within 0.0001 of your actual latency:
tolerance = 0.0001;
for i = 1:length(actual_latency)
row = find(abs(predicted_data(:,1) - actual_latency(i)) < tolerance);
...
The topic of floating point comparison is also covered in this related question.
You may try to do the following:
row = find(abs(predicted_data(:,1) - actual_latency(i))) < eps)
EPS is accuracy of floating-point operation.
Have you tried using a tolerance rather than == ?