Keeping track of moves when pre-processing graphs - matlab

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

Related

scipy.interpolate.griddata slow due to unnecessary data

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378

plans and subset of rows in pyfftw

I want to use pyfft to repeatedly compute the discrete Fourier transform of a subset of rows for a two-dimensional array. I do not know in advance which rows I need to transform, that depends on the output from the previous round. I do know that doing it for all rows is wasteful.
It is my understanding that a 'plan' in FFTW3 is associated with the type of transform (c2c, r2c, etc) and the input/output length, which is always a vector in the 1D case. In pyfftw it looks like a 'plan' is associated to the type of transform and the input/output shape, so my interpretation is that it uses the same FFTW3 plan for every row.
My question is: is it possible to use the same FFTW3 plan for some of the rows, without creating separate pyfftw.FFTW objects for all possible combinations of rows?
On a different note, I am wondering how pyfftw uses multiple cores: does it use multiple cores for each row (this appears natural in view of FFTW3 documentation) or does it farm out different rows to different cores (which was my initial assumption)?
If you can create a numpy array from a view, you can plan for it with pyFFTW - all valid numpy arrays should work just fine.
This means several things:
Your array needs to have regular strides, but those strides can be arbitrary.
ND arrays are planned as ND transforms, with the selected axes being used.
You can probably do something cunning with stride tricks and it will probably work (but might not do what you expect if you do something too nefarious like overlapping rows and then use threads).
One solution that I've used quite a bit is to copy the rows that you want to transform into an interim array, and transforming that. You might well find that's the fastest option (particularly when you can allow for getting the byte offset correct).
Obviously, this doesn't work if you always have a different number of rows. You might still find that if you plan for the largest number of rows that are transformed and then copy in a subset you still do faster than otherwise.
The problem you're going to come up against, even if you go down to the C level, is that the planning overhead might well dominate if you're changing your transform sizes often.
You could also try pyfftw.interfaces.numpy_fft which is normally faster that numpy and has the ability to cache repeated transform sizes.

Structural Sharing in Scala Vector

Structural sharing in Scala List is straightforward and easy to understand. But Scala Vector is a more complicated data structure than a list. How is structural sharing achieved in Scala Vector?
Vector is basically a tree (trie) with 32-wide branching at each level. If you have a Vector of, say, 3000 elements and you want to index element 2045, for example, which converts to 100000010101 in binary, it will decompose it into 5-bit blocks to use as indices into the tree: 10 (i.e. 2) in the first branch then 00000 (i.e. 0) in the next, and finally 10101 (i.e. 21) in the terminal branch, and then there's the data.
Given this structure, it's easy to see how to structurally share things: you can share any sub-trees that aren't changed. So if you make a new vector with a different element 2045, you have to change not all 3000 elements but recreate "only" three arrays of size 32: the terminal one is replaced by a copy with its element 21 updated; then its parent has to be replaced by a copy with this new child in index 0; then its parent has to be replaced with the correct subtree in index 2.
Now, this provides quite a lot of structural sharing as long as you have far more than 32 elements in your vector, but it's still a pretty big overhead. Because of this, additions to the end of the vector are special-cased so that you just add to the existing array. The old Vectors still point to that array, but they think the end is earlier (and that part is unchanged) so it works out okay.
There's a more complex but similar scheme to allow addition at the front of a vector in a similar fashion (basically, by leaving space at the front and keeping track of where to point via indices and offsets in addition to the indexing scheme).
The trick as implemented doesn't work to allow alternating addition to both front and back, though, so there you effectively rebuild the trees every addition. Making a version with even better structural sharing would be possible, but it would probably be a bit slower to access.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

what is the best way to reduce complexity of geometries

so I'm playing around with the http://www.gadm.org/ dataset;
I want to go from lat & lon to a country and state (or equivalent).
So to simplify the data I'm grouping it up and unioning the geometies; so far so good. the results are great, I can pull back Belgium and it is fine.
I pull back australia and I get victoria because the thing is too damn large.
Now I honestly don't care too much about the level of detail; if lines are w/in 1 km of where they should be I'm ok (as long as shapes are bigger, not smaller)
What is the best approach to reduce the complexity of my geospatial objects so I end up with a quick and simple view of the world?
All data is kept as Geometry data.
As you've tagged the question with "tsql" I'm assuming you're using Sql Server. Thus, you already have an handy function called Reduce which you can apply on the geometry data type to simplify it.
For example (copied from the above link):
DECLARE #g geometry;
SET #g = geometry::STGeomFromText('LINESTRING(0 0, 0 1, 1 0, 2 1, 3 0, 4 1)', 0);
SELECT #g.Reduce(.75).ToString();
The function receives a tolerance argument with the simplification threshold.
I suppose complexity is determined only by the number of vertices in a shape. There are quite a number of shape simplifying algorithms to choose from (and maybe some source too).
As a simplistic approach, you can iterate over vertices and reject concave ones if the result does not intoduce an error too large (e.g. in terms of added area), preferably adjoining smaller segments into larger. A more sophisticated approach might break an existing segment to better remove smaller ones.