I have read bunch of dask examples from either someone's GitHub code or the dask issues. But still have a problem of using Scipy interpolation with Dask parallel computing and hoping someone here can help me to solve it.
I actually have issue in how to expand each partition boundary. Please see my description below, and let me know if you cannot understand it.
My data is unstructured and cannot using array
My interpolation code is running, but there are some strange points occurring, I bet that is because of the edge effect. For example
The left panel is the output from "dask" linearNDInterpolator, while the middle panel is the original dataset and the right panel is using linearNDInterpolator directly.
The left panel is a scatter plot that is linear interpolated from original dataset, while the right hand side one is using dask linearinterpolation by dask.dataframe [parallel].
You can clearly see that the parallel computing results has no clear shape, and may possible see some strange points within the map.
Here is my code 01: Using dask.array.
def lNDIwrap(src_lon, src_lat, src_elv, tag_lon, tag_lat):
return lNDI(list(zip(src_lon, src_lat)), src_elv)(tag_lon, tag_lat)
n_splits=96
#--- topodf is the topography dataset [pandas].
#--- df is the python h3 generated hexagon grid by topodf in resolution-12
dsrc = dd.from_pandas(topodf, npartitions=n_splits)
dtag = dd.from_pandas(df, npartitions=n_splits)
#--- Output the chunking dask array
slon,slat,data = dsrc.to_dask_array(lengths=True).T
tlon,tlat = dtag[['lon','lat']].to_dask_array(lengths=True).T
#--- Using **dask delayed** function to pass each partition into functions [lNDIwrap]
gd_chunked_lNDI = [delayed(lNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
#--- Using **dask delayed** function to concat all partitions into one np.array
gd_lNDI = delayed(da.concatenate)(gd_chunked_lNDI, axis=0)
results_lNDI = np.array(gd_lNDI.compute())
Here is my code 02: Using dask.dataframe.
def DDlNDIwrap(df, data_name='nir'):
dtag = target_hexgrid(df, hex_res=12)
slon, slat, data = df[['lon','lat',data_name]].values.T
tlon, tlat = dtag[['lon','lat']].values.T
tout = lNDI(list(zip(slon, slat)), data)(tlon, tlat)
dout = pd.DataFrame(np.vstack([tlon,tlat,tout]).T,
columns=['lon','lat',data_name]
)
return dd.from_pandas(dout, npartitions=1)
n_splits=96
#--- ds is the sentinel-2 satellite dataset. Reading from .til file.
gd_chunked_lNDI = [ delayed(DDlNDIwrap)(ds) for ds in dsrc.to_delayed()]
gd_lNDI = delayed(dd.concat)(gd_chunked_lNDI, axis=0)
gd = gd_lNDI.compute().compute()
I suspected that unknown patterns are coming on the edge/side-effect. What I meant is that those points could be around the edge of each partition so that there have not enough data points for interpolation. I found that in Dask manual I could possibly be able to use map_overlap, map_partitions, and map_blocks to solve my question. But I keep failed it. Could someone help me to solve this?
PS:
Following is what I tried by using map_overlap function.
def maplNDIwrap(df, data_name='nir'):
dtag = target_hexgrid(df, hex_res=12)
slon, slat, data = df[['lon','lat',data_name]].values.T
tlon, tlat = dtag[['lon','lat']].values.T
tout = lNDI(list(zip(slon, slat)), data)(tlon, tlat)
print(len(tlon), len(tout))
dout = pd.DataFrame(np.vstack([tlon,tlat,tout]).T,
columns=['lon','lat',data_name]
)
return dout
dtag = target_hexgrid(dsrc.compute(), hex_res=12)
gd_map_lNDI = dsrc.map_overlap(maplNDIwrap,1,1, meta=type(dsrc))
print(len(dtag)) #--> output: 353678
gd_map_lNDI.compute() #--> Not match expected size
Updates:
Here I defined few function to generate synthetic dataset.
def lonlat(lon_min,lon_max, lat_min, lat_max, res=5):
xps = round((lon_max - lon_min)*110*1e3/res)
yps = round((lat_max - lat_min)*110*1e3/res)
return np.meshgrid(np.linspace(lon_min,lon_max,xps),
np.linspace(lat_min,lat_max,yps)
)
def xy_based_map(x,y):
x = np.pi*x/180
y = np.pi*y/180
return np.log10((1 - x / 3. + x ** 2 + (2*x*y) ** 3) * np.exp(-x ** 2 - y ** 2))
Using the similar method I provided above with the result below.
You can definitely see there are lines in the interpolation outputs.
lon_min, lon_max, lat_min, lat_max = -70.42,-70.40,-30.42,-30.40
lon05, lat05 = lonlat(lon_min, lon_max, lat_min, lat_max,res=5)
z05 = xy_based_map(lon05,lat05)
df05= pd.DataFrame(np.vstack((lon05.ravel(), lat05.ravel(), z05.ravel())).T, columns=['lon','lat','z'])
df05= dd.from_pandas(df05, npartitions=n_splits)
lon30, lat30 = lonlat(lon_min, lon_max, lat_min, lat_max,res=30)
z30 = xy_based_map(lon30,lat30)
df30= pd.DataFrame(np.vstack((lon30.ravel(), lat30.ravel(), z30.ravel())).T, columns=['lon','lat','z'])
df30= dd.from_pandas(df30, npartitions=n_splits)
```.
```python
tlon, tlat = df05[['lon','lat']].values.T
slon, slat, data = df10.values.T
gd_chunked_lNDI = [delayed(lNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_chunked_cNDI = [delayed(cNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_chunked_rNDI = [delayed(rNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_lNDI = delayed(da.concatenate)(gd_chunked_lNDI, axis=0)
gd_cNDI = delayed(da.concatenate)(gd_chunked_cNDI, axis=0)
gd_rNDI = delayed(da.concatenate)(gd_chunked_rNDI, axis=0)
results_lNDI_10m = np.array(gd_lNDI.compute())
results_cNDI_10m = np.array(gd_cNDI.compute())
results_rNDI_10m = np.array(gd_rNDI.compute())
#--- No parallel computing
a,b,c,d,e = slon.compute(),slat.compute(),data.compute(),tlon.compute(),tlat.compute()
straight_lNDI_10m = lNDIwrap(a,b,c,d,e)
###--- 30m --> 5m
tlon, tlat = df05[['lon','lat']].values.T
slon, slat, data = df30.values.T
gd_chunked_lNDI = [delayed(lNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_chunked_cNDI = [delayed(cNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_chunked_rNDI = [delayed(rNDIwrap)(x1, y1, newarr, xx, yy) for x1, y1, newarr, xx, yy in \
zip(slon.to_delayed().flatten(),
slat.to_delayed().flatten(),
data.to_delayed().flatten(),
tlon.to_delayed().flatten(),
tlat.to_delayed().flatten())]
gd_lNDI = delayed(da.concatenate)(gd_chunked_lNDI, axis=0)
gd_cNDI = delayed(da.concatenate)(gd_chunked_cNDI, axis=0)
gd_rNDI = delayed(da.concatenate)(gd_chunked_rNDI, axis=0)
results_lNDI_30m = np.array(gd_lNDI.compute())
results_cNDI_30m = np.array(gd_cNDI.compute())
results_rNDI_30m = np.array(gd_rNDI.compute())
###--- No Parallel for 30m --> 5m
a,b,c,d,e = slon.compute(),slat.compute(),data.compute(),tlon.compute(),tlat.compute()
straight_lNDI_30m = lNDIwrap(a,b,c,d,e)
###--- For plots.
dout= pd.DataFrame(np.vstack((tlon.compute(),tlat.compute(),df05.z.values.compute(),
results_lNDI_10m,results_cNDI_10m, results_rNDI_10m,straight_lNDI_10m,
results_lNDI_30m,results_cNDI_30m, results_rNDI_30m,straight_lNDI_30m,
)).T, columns=['lon','lat','orig','lNDI10','cNDI10','rNDI10','stgh10','lNDI30','cNDI30','rNDI30','stgh30'])
I have a list of vectors, like this:
{x = 7, y = 0.}, {x = 2.5, y = 0.}, {x = -2.3, y = 0.}, {x = 2.5, y = 2.7}, {x = 2.5, y = -2.7}
How do I convert these to data I can plot? I've been trying with the "convert" function, but can't get it to work.
When I manually convert it to something like [[7, 0], [2.5, 0], [-2.3, 0], [2.5, 2.7], [2.5, -2.7]] it works, though there has to be an automatic way, right?
A little more info about what I'm doing if you're interested:
I have a function U(x,y), of which I calculate the gradient and then check where it becomes 0, like this:
solve(convert(Gradient(U(x, y), [x, y]), set), {x, y});
that gives me my list of points. Now I would like to plot these points on a graph.
Thanks!
S:={x = 7, y = 0.}, {x = 2.5, y = 0.}, {x = -2.3, y = 0.},
{x = 2.5, y = 2.7}, {x = 2.5, y = -2.7}:
T:=map2(eval,[x,y],[S]);
[[7, 0.], [2.5, 0.], [-2.3, 0.], [2.5, 2.7], [2.5, -2.7]]
I have large but sparse arrays and I want to rearrange them by swapping rows an columns. What is a good way to do this in scipy.sparse?
Some issues
I don't think that permutation matrices are well suited for this task, as they like randomly change the sparsity structure. And a manipulation will always 'multiply' all columns or rows, even if there are only a few swaps necessary.
What is the best sparse matrix representation in scipy.sparse for this task?
Suggestions for implementation are very welcome.
I have tagged this with Matlab as well, since this question might find an answer that is not necessarily scipy specific.
CSC format keeps a list of the row indices of all non-zero entries, CSR format keeps a list of the column indices of all non-zero entries. I think you can take advantage of that to swap things around as follows, and I think there shouldn't be any side-effects to it:
def swap_rows(mat, a, b) :
mat_csc = scipy.sparse.csc_matrix(mat)
a_idx = np.where(mat_csc.indices == a)
b_idx = np.where(mat_csc.indices == b)
mat_csc.indices[a_idx] = b
mat_csc.indices[b_idx] = a
return mat_csc.asformat(mat.format)
def swap_cols(mat, a, b) :
mat_csr = scipy.sparse.csr_matrix(mat)
a_idx = np.where(mat_csr.indices == a)
b_idx = np.where(mat_csr.indices == b)
mat_csr.indices[a_idx] = b
mat_csr.indices[b_idx] = a
return mat_csr.asformat(mat.format)
You could now do something like this:
>>> mat = np.zeros((5,5))
>>> mat[[1, 2, 3, 3], [0, 2, 2, 4]] = 1
>>> mat = scipy.sparse.lil_matrix(mat)
>>> mat.todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 0., 1., 0., 1.],
[ 0., 0., 0., 0., 0.]])
>>> swap_rows(mat, 1, 3)
<5x5 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in LInked List format>
>>> swap_rows(mat, 1, 3).todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 1.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
>>> swap_cols(mat, 0, 4)
<5x5 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in LInked List format>
>>> swap_cols(mat, 0, 4).todense()
matrix([[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 0.]])
I have used a LIL matrix to show how you could preserve the type of your output. In your application you probably want to already be in CSC or CSR format, and select whether to swap rows or columns first based on it, to minimize conversions.
I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, row_order=None, col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix to the specified order.
"""
if row_order is None and col_order is None:
return M
new_M = M
if row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[row_order]
new_M = I.dot(new_M)
if col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[col_order]
new_M = new_M.dot(I)
return new_M
In Matlab you can just index the columns and rows the way you like:
Matrix = speye(10);
mycolumnorder = [1 2 3 4 5 6 10 9 8 7];
myroworder = [4 3 2 1 5 6 7 8 9 10];
Myorderedmatrix = Matrix(myroworder,mycolumnorder);
I think this preserves sparsity... Don't know about scipy though...
I need to emulate the MATLAB function find, which returns the linear indices for the nonzero elements of an array. For example:
>> a = zeros(4,4)
a =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
>> a(1,1) = 1
>> a(4,4) = 1
>> find(a)
ans =
1
16
numpy has the similar function nonzero, but it returns a tuple of index arrays. For example:
In [1]: from numpy import *
In [2]: a = zeros((4,4))
In [3]: a[0,0] = 1
In [4]: a[3,3] = 1
In [5]: a
Out[5]:
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 1.]])
In [6]: nonzero(a)
Out[6]: (array([0, 3]), array([0, 3]))
Is there a function that gives me the linear indices without calculating them myself?
numpy has you covered:
>>> np.flatnonzero(a)
array([ 0, 15])
Internally it's doing exactly what Sven Marnach suggested.
>>> print inspect.getsource(np.flatnonzero)
def flatnonzero(a):
"""
Return indices that are non-zero in the flattened version of a.
This is equivalent to a.ravel().nonzero()[0].
[more documentation]
"""
return a.ravel().nonzero()[0]
The easiest solution is to flatten the array before calling nonzero():
>>> a.ravel().nonzero()
(array([ 0, 15]),)
If you have matplotlib installed it's probably already there (find that is) in matplotlib.mlab module, as well as some other functions intended for compatibility with matlab. And yes it's implemented the same way as flatnonzero.