I have problem. I have database with 500k records. Each record store latitude, longitude, specie of animal,date of observation. I must draw grid(15x10) above mapkit view, that show the concentration of specie in this grid cell. Each cell is 32x32 box.
If I calculate in run-time it is very slow.
Have somebody idea how to cache it?In memory or in database.
Data structure:
Observation:
Latitude
Longitude
Date
Specie
some other unimportant data
Screen sample:
alt text http://img6.imageshack.us/img6/7562/20091204201332.png
Each red box opocasity show count of species in this region.
Code that i use now:
data -> select from database, it is all observation in map region
for (int row = 0; row < rows; row++)
{
for (int column = 0; column < columns; column++)
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
maxL=boxes[row][column].latitude;
if (row!=rows-1) {
minL=boxes[row+1][column].latitude;
} else {
minL=buttomRight.latitude;
}
for (int i=0; i<sightingCount; i++) {
l=data[i].latitude;
g=data[i].longitude;
if (l>=minL&&l<maxL&&g>=minG&&g<maxG) {
for (int j=0; j<speciesPerBox; j++) {
if (speciesCountArray[j]==data[i].specie) {
hasSpecie=YES;
}
}
if (hasSpecie==NO) {
speciesCountArray[speciesPerBox]=data[i].specie;
speciesPerBox++;
}
hasSpecie=NO;
}
}
}
mapData[row][column].count = speciesPerBox;
}
}
Since you data is static, you can pre-compute each species for each grid and store it in the database instead of all the location coordinates.
Since you have 15 x 10 = 150 cells, you'll end up with 150 * [num of species] records in the database, which should be a much smaller number.
Also, make sure you have indexes on the proper columns. Otherwise, your queries will have to scan every single record over and over again.
The loop for (int i=0; i<sightingCount; i++) is killing your performance. Especially the large number of if (l>=minL&&l<maxL&&g>=minG&&g<maxG) statements, where MOST OF the sightings will be skipped.
How large is sightingCount?
First you should use a kind of spatial optimization, e.g. a simple one: store species count lists per cell (lets call them "zones"). Define those zones rather large, so that you do not waste space. But smaller zones provide better performance, and too small zones will reverse the effect. So, make it configurable and test different zone sizes to find a good compromise!
When its time to sum up number of species in a cell for rendering, determine which zones the given cell overlaps (rather simple and fast "rectangle overlap" test). Then you only have to check the species counts of those zones. This largely reduces the iterations of your inner loop!
Thats the idea (of most "spatial optimizations"): divide and conquer; here you will divide your space, and then you can early reject the processing of a large number of irrelevant "sightings" with minimal effort (the added effort is the rectangle overlap test, but each test rejects multiple sightings, your current code tests each single sighting for relevance).
In a second step, also apply some obvious code optimizations: e.g. minL and maxL do not change per column. Computing minL and maxL can be moved to the outer loop (just before for( int column=0; ...).
As the latitudes of the grids are evenly distributed, you can even remove them from your grid cells, which saves some time in your iteration. Here an example (the spatial optimizations not included):
maxL=boxes[0][0].latitude;
minL=boxes[rows-1][0].latitude;
incL=maxL-minL;
for( int row = 0; row < rows; row++ )
{
for( int column = 0; column < columns; column++ )
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
...
...
}
...
minL = maxL; // left edge = right edge of previous step
maxL += incL; // increment right edge
if( maxL >= 90 ) maxL -= 90; // check your scale, i assume 90°
}
Maybe this also works for the longitude loop, but longitude may not be evenly distributed (i.e. "incG" is different in each step).
Note that the spatial optimization will make a huge difference, the loop optimizations only a small (but still worth) difference.
With 500k records this sounds like a job for core data. Preferably core data on a desktop. If the data isn't being updated in realtime you should process the data on heavier hardware and just use the iPhone to display it. That would massively simplify the app because you would just to store the value for each map cell.
Even if you did want to process it on the iPhone, you should have the app process the data once and save the results. There appears to be no reason to have the app recalculate the species value of every map cell every time it wants t display a cell.
I would suggest creating a entity in core data to represent observations. Then another entity to represent geographical squares. Set a relationship between the squares and the observations that fall within the square. Then create a calculated value of species in the square entity. You would then only have to recalculate the species value if one of the observations changed.
This is the kind of problem that object graphs were created for. Even if the data is being continuously updated. Core data would only perform those calculations needed to accommodate the small number of observation objects that changed at any given time and it would do so in a highly optimized manner.
Edit01:
Approaching the problem from a completely different angle. Within core data.
(1) Create an object graph of observation records such that each each observation object has a reciprocal relation to the other observation objects that are closest to it geographically. This would create an object graph that would look like a flat irregular net.
(2) Create methods for the observationRecords class that (a) determine if the record lays within the bounds of an arbitrary geographic square (b) ask if each of its releated record if they are also in the square (c) return its own species count and the count of all the related records.
(3) Divide your map into the some reasonable small squares e.g. one second of arc square. Within that square select one linked record and add it to a list. Choose some percentage of all records like 1 in every 100 or 1,000 so that you cut the list down from 500k to to create a sublist that can be quickly searched by brute force predicate. Let's call those records in the list the gridflags.
(4) When the user zooms in, use brute force to find all the gridflag records with the geographical grid. Then ask each gridflag record to send messages to each of its linked records to see if (a) they're inside the grid, (b) what their species count is and (c) what the count is for their linked records that are also within the grid. (Use a flag to make sure each record is only queried once per search to prevent runaway recursion.)
This way, you only have to find one record inside each arbitrarily sized grid cell and that record will find all the other records for you. Instead of stepping through each record to see which record goes in what cell every time, you just have to process the records in each cell and those immediately adjacent. As you zoom in, the number of records you actually query shrinks instead of remaining constant. If a grid cell has only a handful of records, you only have to query a handful of records.
This would take some effort and time to set up but once you did it would be rather efficient especially when zoomed way in. For the top level, just have a preprocessed static map.
Hope I explained that well enough. It's hard to convey verbally.
Related
I have a large table in MATLAB which contains over 1000 rows of data, in two columns. Column 1 is the ID of the sensor which gathered the data, and column two is the data itself (in this case a voltage).
I have been able to sort my table to gather all the data for sensors together. So, all the data from Sensor 1 is in rows 1 to 100, the data for Sensor 2 is in rows 101 to 179, the data for Sensor 3 is in rows 180 to 310, and so on. In other words, the number of rows which contain data for a given sensor is never the same.
Now, I want to split this main table into separate tables for each sensor ID, and I am having trouble figuring out a way to do it. I imagine I could do it with a loop, where my I cycle through the various IDs, but that doesn't seem like a very, MATLAB way of doing it.
What would be an efficient way to complete this task? Or would a loop really be the only way?
I have attached a small screenshot of some of my data.
The screenshot you shared shows a 1244x1 structure array with 2 fields but the question describes a table. You could convert the structure array to a table using,
T = struct2table(S); % Assuming S is the name of your structure
Whether the variable is a structure or table, it's better to not separate the variable and to use indexing instead. For example, assuming the variable is a table, you can compute the mean voltage for sensor1 using,
mean(T.reported_voltage(strcmp(T.sensor_id,'Sensor1')))
and you could report the mean of all groups using,
groupsummary(T,'sensor_id', 'mean')
or
splitapply(#mean,T.reported_voltage,findgroups(T.sensor_id))
But if you absolutely must break apart and tidy, well-organized table, you can do so by splitting the table into sub-tables stored within a cell array using,
unqSensorID = unique(T.sensor_id);
C = arrayfun(#(id){T(strcmp(T.sensor_id, id),:)},unqSensorID)
In this case the for loop is fine because (I guess) there aren't that many different sensors and your code will likely spend most of its time processing the data anyway - the loop won't give you a significant overhead.
Assuming your table is called t, the following should do what you want.
unique_sensors = unique(t.sensor_id)
for i = 1:length(unique_sensors)
sensor_data = t(t.sensor_id == unique_sensors(i), :);
% save or do some processing on this data
end
I do a simple project where I calculate how many times a ball in casino roulette game lands in each pocket after after 20000 spins.
I have data for 2 different roulette tables with approx 10k spin results for each. I count a number of occurrences of each result which is fine.
The problem arises when I want to limit the calculation to lets say 5k results (rows of data) for each category (Roulette name in this case) and see how many times the ball land in each pocket after 5k spins.
How do I limit how many rows of data per each category do I pass to COUNT() function?
Example of my wb as an img
Try this:
You will have to add a pk field for each of your results so your pk will be 1-10000 or however many records you have.
Parameter Setup:
Choose pk from the Set From Field options
Create a calculated field named N Results
N Results:
IF (pk - 1) < [View for N Results] THEN [result] END
Add N Results to your filter shelf and Exclude NULL. Show View for N Results parameter and type in a value. Your grand total should now equal the parameter value.
Final Layout:
Let me know how that turns out.
I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.
I am recording voltage changes over a small circuit- this records mouse feeding. When the mouse is eating, the circuit voltage changes, I convert that into ones and zeroes, all is well.
BUT- I want to calculate the number and duration of 'bursts' of feeding- that is, instances of circuit closing that occur within 250 ms (75 samples) of one another. If the gap between closings is larger than 250ms I want to count it as a new 'burst'
I guess I am looking for help in asking matlab to compare the sample number of each 1 in the digital file with the sample number of the next 1 down- if the difference is more than 75, call the first 1 the end of one bout and the second one the start of another bout, classifying the difference as a gap, but if it is NOT, keep the sample number of the first 1 and compare it against the next and next and next until there is a 75-sample difference
I can compare each 1 to the next 1 down:
n=1; m=2;
for i = 1:length(bouts4)-1
if bouts4(i+1) - bouts4(i) >= 75 %250 msec gap at a sample rate of 300
boutend4(n) = bouts4(i);
boutstart4(m)= bouts4(i+1);
m = m+1;
n = n+1;
end
I don't really want to iterate through i for both variables though...
any ideas??
-DB
You can try the following code
time_diff = diff(bouts4);
new_feeding = time_diff > 75;
boutend4 = bouts4(new_feeding);
boutstart4 = [0; bouts4(find(new_feeding) + 1)];
That's actually not too bad. We can actually make this completely vectorized. First, let's start with two signals:
A version of your voltages untouched
A version of your voltages that is shifted in time by 1 step (i.e. it starts at time index = 2).
Now the basic algorithm is really:
Go through each element and see if the difference is above a threshold (in your case 75).
Enumerate the locations of each one in separate arrays
Now onto the code!
%// Make those signals
bout4a = bouts4(1:end-1);
bout4b = bouts4(2:end);
%// Ensure column vectors - you'll see why soon
bout4a = bout4a(:);
bout4b = bout4b(:);
% // Step #1
loc = find(bouts4b - bouts4a >= 75);
% // Step #2
boutend4 = [bouts4(loc); 0];
boutstart4 = [0; bouts4(loc + 1)];
Aside:
Thanks to tail.b.lo, you can also use diff. It basically performs that difference operation with the copying of those vectors like I did before. diff basically works the same way. However, I decided not to use it so you can see how exactly your code that you wrote translates over in a vectorized way. Only way to learn, right?
Back to it!
Let's step through this slowly. The first two lines of code make those signals I was talking about. An original one (up to length(bouts) - 1) and another one that is the same length but shifted over by one time index. Next, we use find to find those time slots where the time index was >= 75. After, we use these locations to access the bouts array. The ending array accesses the original array while the starting array accesses the same locations but moved over by one time index.
The reason why we need to make these two signals column vector is the way I am appending information to the starting vector. I am not sure whether your data comes in rows or columns, so to make this completely independent of orientation, I'm going to make sure that your data is in columns. This is because if I try to append a 0, if I do it to a row vector I have to use a space to denote that I'm going to the next column. If I do it for a column vector, I have to use a semi-colon to go to the next row. To completely avoid checking to see whether it's a row or column vector, I'm going to make sure that it's a column vector no matter what.
By looking at your code m=2. This means that when you start writing into this array, the first location is 0. As such, I've artificially placed a 0 at the beginning of this array and followed that up with the rest of the values.
Hope this helps!
I have a time series of measurements taken at different depths of a water column. I have divided these into individual cells (for later) and require some help on how to complete the following: e.g.
time = [733774,733774,733775,733775,733775,733776,733776];
bthD = [20,10,0,15,10,20,10];
bthA = (1000:100:1600);
%Hypsographic
Hypso = [(10:1:20)',(1000:100:2000)'];
d = [1,1.3,1,2.5,2.5,1,1.2];
data = horzcat(time',bthD',d');
uniqueTimes = unique(time);
counts = hist(time,uniqueTimes);
newData = mat2cell(data,counts,length(uniqueTimes));
So, in newData I have three cells, that correspond to different days of measurements, in each cell I have newData(:,1) being time, newData(:,2) being depth, and newData(:,3) being the measurement. I would like to find what the area is at each depth in the cells, the area at different depths is given in the variable 'Hypso'.
How could I achieve this?
Your problem formulation is excellent! Very easy to understand what you need here. All you need is the function interp1. Use the first column of Hypso, I assume, as your depth, and the second column as the area. You can use the vectorized ability of the interp1 function to find all values in one call:
areaAtDepth = interp1(Hypso(:,1),Hypso(:,2),bthD)
areaAtDepth =
Columns 1 through 6
2000 1000 NaN 1500 1000 2000
Column 7
1000
You'll notice the Nan in the third column of the output. This is because it's associated depth, 0, is outside the range of the data, or support of the data I believe. You'll need to decide what you want to do when data is outside the range, or perhaps it never should be, so an error should be logged; it's up to you! Let me know if you have any more questions!