Sane cubic interpolation for "large" data set, alternative to interp1d? - scipy

I am working with audio data, so my data sets are usually around 40000 to 120000 points (1 to 3 seconds). Currently I am using linear interpolation for some task and I would like to use cubic interpolation to improve some results.
I have been using interp1d with kind='linear' to generate an interpolation function. This works great and is very intuitive.
However, when I switch to kind='cubic', my computer goes nuts --- the memory starts thrashing, the Emacs window goes dark, the mouse pointer starts moving very slowly, and the harddrive becomes very active. I assume this is because it's using a lot of memory. I am forced to (very slowly) open a new terminal window, run htop, and kill the Python process. (I should have mentioned I am using Linux.)
My understanding of cubic interpolation is that it only needs to examine 5 points of the data set at a time, but maybe this is mistaken.
In any case, how can I most easily switch from linear interpolation to cubic interpolation without hitting this apparent brick wall of memory usage? All the examples of interp1d use very few data points, and it's not mentioned anywhere in the docs that it won't perform well for higher orders, so I have no clue what to try next.
Edit: I just tried UnivariateSpline, and it's almost what I'm looking for. The problem is that the interpolation does not touch all data points. I'm looking for something that generates smooth curves that pass through all data points.
Edit2: It looks like maybe InterpolatedUnivariateSpline is what I was looking for.

I had a similar problem in ND interpolation. My solution was to split the data into domains and construct interpolation functions for each domain.
In your case, you can split your data into bunches of 500 points and interpolate over them depending where you are.
f1 = [0,...495]
f2 = [490,...,990]
f3 = [985,...,1485]
..
.
.
.
and so on.
Also make sure to have an overlap of the intervals of each function. In the example,
the overlap is 5 points. I guess you have to do some experimenting to see what is the optimal overlap.
i hope this helps.

Related

scipy.interpolate.griddata slow due to unnecessary data

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378

why if I put a filter on an output I modify the source signal? is this a simulink bug?

I know it sounds strange and that's a bad way to write a question,but let me show you this odd behavior.
as you can see this signal, r5, is nice and clean. exactly what I expected from my simulation.
now look at this:
this is EXACTLY the same simulation,the only difference is that the filter is now not connected. I tried for hours to find a reason,but it seems like a bug.
This is my file, you can test it yourself disconnecting the filter.
----edited.
Tried it with simulink 2014 and on friend's 2013,on two different computers...if Someone can test it on 2015 it would be great.
(attaching the filter to any other r,r1-r4 included ''fixes'' the noise (on ALL r1-r8),I tried putting it on other signals but the noise won't go away).
the expected result is exactly the smooth one, this file showed to be quite robust on other simulations (so I guess the math inside the blocks is good) and this case happens only with one of the two''link number'' (one input on the top left) set to 4,even if a small noise appears with one ''link number'' set to 3.
thanks in advance for any help.
It seems to me that the only thing the filter could affect is the time step used in the integration, assuming you are using a dynamic time step (which is the default). So, my guess is that (if this is not a bug) your system is numerically unstable/chaotic. It could also be related to noise, caused by differentiation. Differentiating noise over a smaller time step mostly makes things even worse.
Solvers such as ode23 and ode45 use a dynamic time step. ode23 compares a second and third order integration and selects the third one if the difference between the two is not too big. If the difference is too big, it does another calculation with a smaller timestep. ode45 does the same with a fourth and fifth order calculation, more accurate, but more sensitive. Instabilities can occur if a smaller time step makes things worse, which could occur if you differentiate noise.
To overcome the problem, try using a fixed time step, change your precision/solver, or better: avoid differentiation, use some type of state estimator to obtain derivatives or calculate analytically.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

Where are jplephem ephemerides api documented?

I am working on what is likely a unique use case - I want to use Skyfield to do some calculations on a hypothetical star system. I would do this by creating my own ephemeris, and using that instead of the actual one. The problem i am finding is that I cannot find documentation on the API to replace the ephemerides with my own.
Is there documentation? Is skyfield something flexible enough to do what I am trying?
Edit:
To clarify what I am asking, I understand that I will have to do some gravitational modeling (and I am perfectly willing to configure every computer, tablet, cable box and toaster in this house to crunch on those numbers for a few days :), but before I really dive into it, I wanted to know what the data looks like. If it is just a module with a number of named numpy 2d arrays... that makes it rather easy, but I didn't see this documented anywhere.
The JPL-issued ephemerides used by Skyfield, like DE405 and DE406 and DE421, simply provide a big table of numbers for each planet. For example, Neptune’s position might be specified in 7-day increments, where for each 7-day period from the beginning to the end of the ephemeris the table provides a set of polynomial coefficients that can be used to estimate Neptune's position at any moment from the beginning to the end of that 7-day period. The polynomials are designed, if I understand correctly, so that their first and second derivative meshes smoothly with the previous and following 7-day polynomial at the moment where one ends and the next begins.
The JPL generates these huge tables by taking the positions of the planets as we have recorded them over human history, taking the rules by which we think an ideal planet would move given gravitational theory, the drag of the solar wind, the planet's own rotation and dynamics, its satellites, and so forth, and trying to choose a “real path” for the planet that agrees with theory while passing as close to the actual observed positions as best as it can.
This is a big computational problem that, I take it, requires quite a bit of finesse. If you cannot match all of the observations perfectly — which you never can — then you have to decide which ones to prioritize, and which ones are probably not as accurate to begin with.
For a hypothetical system, you are going to have to start from scratch by doing (probably?) a gravitational dynamics simulation. There are, if I understand correctly, several possible approaches that are documented in the various textbooks on the subject. Whichever one you choose should let you generate x,y,z positions for your hypothetical planets, and you would probably instantiate these in Skyfield as ICRS positions if you then wanted to use Skyfield to compute distances, observations, or to draw diagrams.
Though I have not myself used it, I have seen good reviews of:
http://www.amazon.com/Solar-System-Dynamics-Carl-Murray/dp/0521575974

Implementing runtime-constant persistent "LUT"s in MATLAB functions

I am implementing (for class, so no built-ins!) a variant of the Hough Transform to detect circles in an image. I have a working product, and I get correct results. In other words, I'm done with the assignment! However, I'd like to take it one step further and try to improve the performance a little bit. Also, to forestall any inevitable responses, I know MATLAB isn't exactly the most performance-based language, and I know that the Hough transform isn't exactly performance-based either, but hear me out.
When generating the Hough accumulation space, I end up needing to draw a LOT of circles (approximately 75 for every edge pixel in the image; one for each search radius). I wrote a nifty function to do this, and it's already fairly optimized. However, I end up recalculating lots of circles of the same radius (expensive) at different locations (cheap).
An easy way for me to optimize this was precalculate a circle of each radius centered at zero once, and just select the proper circle and shift it into the correct position. This was easy, and it works great!
The trouble comes when trying to access this lookup table of circles.
I initially made it a persistent variable, as follows:
[x_subs, y_subs] = get_circle_indices(circ_radius, circ_x_center, circ_y_center)
persistent circle_lookup_table;
% Make sure the table's already been generated; if not, generate it.
if (isempty(circle_lookup_table))
circle_lookup_table = generate_circles(100); %upper bound circ size
end
% Get the right circle from the struct, and center it at the requested center
x_subs = circle_lookup_table(circ_radius).x_coords + circ_x_center;
y_subs = circle_lookup_table(circ_radius).y_coords + circ_y_center;
end
However, it turns out this is SLOW!
Over 200,000 function calls, MATLAB spent on average 9 microseconds each call just to establish that the persistent variable exists! (Not the isEmpty() call, but the actual variable declaration). This is according to MATLAB's built-in profiler.
This added back most of the time gained from implementing the lookup table.
I also tried implementing it as a global variable (similar time to check if the variable is declared) or passing it in as a variable (made the function call much more expensive).
So, my question is this:
How do I provide fast access inside a function to runtime-constant data?
I look forward to some suggestions.
It is NOT runtime-constant data, since your function has the ability of generating the table. So your main problem is to discard this instruction from the function. Before all the calls of this critical function, make sure that this array is generated elsewhere outside the function.
However, there is a nice trick I've read from Matlab's files, and more specifically bwmorph. For each functionality of this particuliar function which requires a LUTs, they have created a function which returns the LUT itself (the LUT is written explicitely in the file). They also add an instruction coder.inline('always') to ensure that this function will be inlined. It seems to be quite efficient!