scipy.interpolate.griddata slow due to unnecessary data - scipy

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.

So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.


Normalized histogram in MATLAB incorrect?

I have the following set of data:
4.659]; % Sample set
I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:
Where the following bit of code is what draws the histogram:
[N,edges,bin] = histcounts(X,'BinMethod','auto');
Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!
I would agree with #tashuhka's comment that you need to think about how you're binning your data.
Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.
There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.
Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.

Keeping track of moves when pre-processing graphs

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

Find points near LineString in mongodb sorted by distance

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?
As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer
As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

Sane cubic interpolation for "large" data set, alternative to interp1d?

I am working with audio data, so my data sets are usually around 40000 to 120000 points (1 to 3 seconds). Currently I am using linear interpolation for some task and I would like to use cubic interpolation to improve some results.
I have been using interp1d with kind='linear' to generate an interpolation function. This works great and is very intuitive.
However, when I switch to kind='cubic', my computer goes nuts --- the memory starts thrashing, the Emacs window goes dark, the mouse pointer starts moving very slowly, and the harddrive becomes very active. I assume this is because it's using a lot of memory. I am forced to (very slowly) open a new terminal window, run htop, and kill the Python process. (I should have mentioned I am using Linux.)
My understanding of cubic interpolation is that it only needs to examine 5 points of the data set at a time, but maybe this is mistaken.
In any case, how can I most easily switch from linear interpolation to cubic interpolation without hitting this apparent brick wall of memory usage? All the examples of interp1d use very few data points, and it's not mentioned anywhere in the docs that it won't perform well for higher orders, so I have no clue what to try next.
Edit: I just tried UnivariateSpline, and it's almost what I'm looking for. The problem is that the interpolation does not touch all data points. I'm looking for something that generates smooth curves that pass through all data points.
Edit2: It looks like maybe InterpolatedUnivariateSpline is what I was looking for.
I had a similar problem in ND interpolation. My solution was to split the data into domains and construct interpolation functions for each domain.
In your case, you can split your data into bunches of 500 points and interpolate over them depending where you are.
f1 = [0,...495]
f2 = [490,...,990]
f3 = [985,...,1485]
and so on.
Also make sure to have an overlap of the intervals of each function. In the example,
the overlap is 5 points. I guess you have to do some experimenting to see what is the optimal overlap.
i hope this helps.