Pairwise distance between objects (Xarray) - scipy

I have 3 cars travelling in space (x,y) at 10 time steps.
For each time step I want to calculate the pairwise Euclidean distance between cars.
import numpy as np
from scipy.spatial.distance import pdist
import xarray as xr
data = np.random.rand(3,2,10)
times = pd.date_range('2000-01-01', periods=10)
space = ['x','y']
cars = ['a','b','c']
foo = xr.DataArray(data, coords=[cars,space,times], dims = ['cars','space','time'])
The for loop iteration below works fine, each input is 3*2 array , and pdist is happily calculating a condensed distance matrix for all the pairwise distances between cars
for label,group in foo.groupby('time'):
print(group.shape, type(group), pdist(group) )
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.45389929 0.96104589 0.51489773]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.87532985 0.49758256 0.4418555 ]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.44036486 0.17947479 0.39842543]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.52294711 0.26278261 0.78106623]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.30004324 0.62807379 0.40601505]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.48351623 0.38331324 0.30677522]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.83682031 0.38409803 0.455275 ]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.33614753 0.50814237 0.49033016]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.17365559 0.33567641 0.30382769]
(3, 2) <class 'xarray.core.dataarray.DataArray'> [0.76981095 0.18099241 0.91187884]
but this simple call ( which should do the identical operation as I understand it ) is failing.
foo.groupby('time').apply(pdist)
AttributeError: 'numpy.ndarray' object has no attribute 'dims'
It seems to be having trouble with the return shape ? do I need a u_func here ?
BTW all these calls work fine and returns as expected with a variety of shapes:
foo.groupby('time').apply(np.mean)
foo.groupby('time').apply(np.mean,axis=0)
foo.groupby('time').apply(np.mean,axis=1)
thanks in advance for any pointers...

pdist changes the array size and therefore xarray can not find its coordinates.
How about the following?
In [12]: np.sqrt(((foo - foo.rename(cars='cars1'))**2).sum('space'))
Out[12]:
<xarray.DataArray (cars: 3, time: 10, cars1: 3)>
array([[[0. , 0.131342, 0.352521],
[0. , 0.329914, 0.859899],
[0. , 0.933117, 0.351842],
[0. , 0.802514, 0.426005],
[0. , 0.167081, 0.563704],
[0. , 0.9822 , 0.145496],
[0. , 0.894892, 0.457217],
[0. , 0.333222, 0.505805],
[0. , 0.377352, 0.604625],
[0. , 0.467771, 0.62544 ]],
[[0.131342, 0. , 0.243476],
[0.329914, 0. , 0.813076],
[0.933117, 0. , 0.847525],
[0.802514, 0. , 0.390665],
[0.167081, 0. , 0.562188],
[0.9822 , 0. , 0.957067],
[0.894892, 0. , 0.525863],
[0.333222, 0. , 0.835241],
[0.377352, 0. , 0.894856],
[0.467771, 0. , 0.594124]],
[[0.352521, 0.243476, 0. ],
[0.859899, 0.813076, 0. ],
[0.351842, 0.847525, 0. ],
[0.426005, 0.390665, 0. ],
[0.563704, 0.562188, 0. ],
[0.145496, 0.957067, 0. ],
[0.457217, 0.525863, 0. ],
[0.505805, 0.835241, 0. ],
[0.604625, 0.894856, 0. ],
[0.62544 , 0.594124, 0. ]]])
Coordinates:
* cars (cars) <U1 'a' 'b' 'c'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-10
* cars1 (cars1) <U1 'a' 'b' 'c'
If you like to have a similar output to pdist, apply_ufunc can be used,
In [21]:xr.apply_ufunc(pdist, foo, input_core_dims=[['cars', 'space']],
...: output_core_dims=[['cars_pair']], vectorize=True)
...:
Out[21]:
<xarray.DataArray (time: 10, cars_pair: 3)>
array([[0.131342, 0.352521, 0.243476],
[0.329914, 0.859899, 0.813076],
[0.933117, 0.351842, 0.847525],
[0.802514, 0.426005, 0.390665],
[0.167081, 0.563704, 0.562188],
[0.9822 , 0.145496, 0.957067],
[0.894892, 0.457217, 0.525863],
[0.333222, 0.505805, 0.835241],
[0.377352, 0.604625, 0.894856],
[0.467771, 0.62544 , 0.594124]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-10
Dimensions without coordinates: cars_pair

Related

TF Keras code adaptation from python2.7 to python3

I am working to adapt a python2.7 code that uses keras and tensorflow to implement a CNN but looks like the keras API has changed a little bit since when the original code was idealized. I keep getting an error about "Negative dimension after subtraction" and I can not find out what is causing it.
Unfortunately I am not able to provide an executable piece of code because I was not capable of make the original code works, but the repository containing all the source files can be found here.
The piece of code:
from keras.callbacks import EarlyStopping
from keras.layers.containers import Sequential
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.layers.core import Reshape, Flatten, Dropout, Dense
from keras.layers.embeddings import Embedding
from keras.models import Graph
from keras.preprocessing import sequence
filter_lengths = [3, 4, 5]
self.model = Graph()
'''Embedding Layer'''
self.model.add_input(name='input', input_shape=(max_len,), dtype=int)
self.model.add_node(Embedding(
max_features, emb_dim, input_length=max_len), name='sentence_embeddings', input='input')
'''Convolution Layer & Max Pooling Layer'''
for i in filter_lengths:
model_internal = Sequential()
model_internal.add(
Reshape(dims=(1, self.max_len, emb_dim), input_shape=(self.max_len, emb_dim))
)
model_internal.add(Convolution2D(
nb_filters, i, emb_dim, activation="relu"))
model_internal.add(
MaxPooling2D(pool_size=(self.max_len - i + 1, 1))
)
model_internal.add(Flatten())
self.model.add_node(model_internal, name='unit_' + str(i), input='sentence_embeddings')
What I have tried:
m = tf.keras.Sequential()
m.add(tf.keras.Input(shape=(max_len, ), name="input"))
m.add(tf.keras.layers.Embedding(max_features, emb_dim, input_length=max_len))
filter_lengths = [ 3, 4, 5 ]
for i in filter_lengths:
model_internal = tf.keras.Sequential(name=f'unit_{i}')
model_internal.add(
tf.keras.layers.Reshape(( 1, max_len, emb_dim ), input_shape=( max_len, emb_dim ))
)
model_internal.add(
tf.keras.layers.Convolution2D(100, i, emb_dim, activation="relu")
)
model_internal.add(
tf.keras.layers.MaxPooling2D(pool_size=( max_len - i + 1, 1 ))
)
model_internal.add(
tf.keras.layers.Flatten()
)
m.add(model_internal)
I do not expect a complete solution, what I am really trying to understand is what is the cause to the following error:
Negative dimension size caused by subtracting 3 from 1 for '{{node conv2d_5/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 200, 200, 1], use_cudnn_on_gpu=true](Placeholder, conv2d_5/Conv2D/ReadVariableOp)' with input shapes: [?,1,300,200], [3,3,200,100].

Octave can not read .mat file saved by Matlab

I have a .mat file saved previously using Matlab, and the header says it's v5. When Octave opens it with load(), it complains
warning: load: can not read non-ASCII portions of UTF characters; replacing unreadable characters with '?'
and all the data structure is gone. Dumping out the variable says
scalar structure containing the fields:
where as when loading with Matlab, the structure is clearly shown.
I opened using Scipy, it looks like this
{'__header__': b'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Sat Aug 4 15:11:49 2018',
'__version__': '1.0',
'__globals__': [],
'doc': <10052x12337 sparse matrix of type '<class 'numpy.float64'>'
with 139589 stored elements in Compressed Sparse Column format>,
'embeddings': array([[ 0.25195 , -1.1312 , -0.016156, ..., -0.024497, -0.4867 ,
-0.42997 ],
[-0.17686 , -0.60787 , 0.29096 , ..., 0.13535 , 0.067657,
0.073915],
[ 0.42054 , 0.39829 , 0.65161 , ..., 0.19725 , 0.58798 ,
-0.04068 ],
...,
[-0.62199 , 0.74258 , -1.0865 , ..., 0.13148 , -1.2473 ,
0.34381 ],
[-0.23951 , 0.15795 , -0.22288 , ..., 0.50322 , -0.27619 ,
0.2259 ],
[-0.21121 , -0.9675 , -0.85478 , ..., -0.59731 , -0.048073,
-0.63362 ]]),
'label_names': array([[array(['business'], dtype='<U8'),
array(['computers'], dtype='<U9'),
array(['culture-arts-entertainment'], dtype='<U26'),
array(['education-science'], dtype='<U17'),
array(['engineering'], dtype='<U11'),
array(['health'], dtype='<U6'),
array(['politics-society'], dtype='<U16'),
array(['sports'], dtype='<U6')]], dtype=object),
'labels': array([[1],
[1],
[1],
...,
[8],
[8],
[8]], dtype=uint8),
'test_idx': array([[ 9868, 9869, 9870, ..., 12335, 12336, 12337]], dtype=uint16),
'train_idx': array([[ 1, 2, 3, ..., 9865, 9866, 9867]], dtype=uint16),
'vocabulary': array([[array(['manufacture'], dtype='<U11'),
array(['manufacturer'], dtype='<U12'),
array(['directory'], dtype='<U9'), ...,
array(['tufts'], dtype='<U5'), array(['reebok'], dtype='<U6'),
array(['chewing'], dtype='<U7')]], dtype=object)}
I tried rewriting the .mat file using different version numbers, and -nocompress option, but none worked.
How can I resave this data structure using Matlab so that Octave can open it without loss of information?

Networkx - Get edge properties in Dijkstra path

I have the following graph
import networkx as nx
g = nx.MultiGraph()
#link 0
g.add_edge("A","B",cost=20,index=0)
#link 1
g.add_edge("A","C",cost=20,index=1)
#link 2
g.add_edge("B","C",cost=10,index=2)
#link 3
g.add_edge("B","D",cost=150,index=3)
#link 4
g.add_edge("C","D",cost=150,index=5)
g.add_edge("C","D",cost=200,index=6)
I'm trying to find the shortest path between A and D and that works
path=nx.dijkstra_path(g,"A","D",weight='cost')
->['A', 'C', 'D']
What i need is to get the edges info ( more specific the index) in this path.
Tryied to far :
edgesinpath=zip(path[0:],path[1:])
for (u,v ) in edgesinpath:
print u,v,g[u][v]
but of course this will out all the edges , that math the u,v in the path:
A C {0: {'index': 1, 'cost': 20}}
C D {0: {'index': 5, 'cost': 150}, 1: {'index': 6, 'cost': 200}}
Any idea how to get the correct information ? Is this available via networkx?
Thx.
One possible solution:
edges_ids = []
for u,v in edgesinpath:
edge = sorted(g[u][v], key=lambda x:g[u][v][x]['cost'])[0]
edges_ids.append(g[u][v][edge]['index'])
This choses for each multi-edge, the edge in your shortest path with a minimal cost.

HDF5 data or label incorrect [duplicate]

I have the train and label data as data.mat. (I have 200 training data with 6000 features and labels are (-1, +1) that have saved in data.mat).
I am trying to convert my data in hdf5 and run Caffe using:
load data.mat
hdf5write('my_data.h5', '/new_train_x', single( reshape(new_train_x,[200, 6000, 1, 1]) ) );
hdf5write('my_data.h5', '/label_train', single( reshape(label_train,[200, 1, 1, 1]) ), 'WriteMode', 'append' );
And my layer.prototxt (just data layer) is:
layer {
type: "HDF5Data"
name: "data"
top: "new_train_x" # note: same name as in HDF5
top: "label_train" #
hdf5_data_param {
source: "/path/to/list/file.txt"
batch_size: 20
}
include { phase: TRAIN }
}
but, i have an error:
( Check failed: hdf_blobs_[i]->shape(0) == num (200 vs. 6000))
I1222 17:02:48.915861 3941 layer_factory.hpp:76] Creating layer data
I1222 17:02:48.915871 3941 net.cpp:110] Creating Layer data
I1222 17:02:48.915877 3941 net.cpp:433] data -> new_train_x
I1222 17:02:48.915890 3941 net.cpp:433] data -> label_train
I1222 17:02:48.915900 3941 hdf5_data_layer.cpp:81] Loading list of HDF5 filenames from: file.txt
I1222 17:02:48.915923 3941 hdf5_data_layer.cpp:95] Number of HDF5 files: 1
F1222 17:02:48.993865 3941 hdf5_data_layer.cpp:55] Check failed: hdf_blobs_[i]->shape(0) == num (200 vs. 6000)
*** Check failure stack trace: ***
# 0x7fd2e6608ddd google::LogMessage::Fail()
# 0x7fd2e660ac90 google::LogMessage::SendToLog()
# 0x7fd2e66089a2 google::LogMessage::Flush()
# 0x7fd2e660b6ae google::LogMessageFatal::~LogMessageFatal()
# 0x7fd2e69f9eda caffe::HDF5DataLayer<>::LoadHDF5FileData()
# 0x7fd2e69f901f caffe::HDF5DataLayer<>::LayerSetUp()
# 0x7fd2e6a48030 caffe::Net<>::Init()
# 0x7fd2e6a49278 caffe::Net<>::Net()
# 0x7fd2e6a9157a caffe::Solver<>::InitTrainNet()
# 0x7fd2e6a928b1 caffe::Solver<>::Init()
# 0x7fd2e6a92c19 caffe::Solver<>::Solver()
# 0x41222d caffe::GetSolver<>()
# 0x408ed9 train()
# 0x406741 main
# 0x7fd2e533ca40 (unknown)
# 0x406f69 _start
Aborted (core dumped)
Many thanks!!!! Any advice would be appreciated!
The problem
It seems like there is indeed a conflict with the order of elements in arrays: matlab arranges the elements from the first dimension to the last (like fortran), while caffe and hdf5 stores the arrays from last dimension to first:
Suppose we have X of shape nxcxhxw then the "second element of X" is X[2,1,1,1] in matlab but X[0,0,0,1] in C (1-based vs 0-based indexing doesn't make life easier at all).
Therefore, when you save an array of size=[200, 6000, 1, 1] in Matlab, what hdf5 and caffe are actually seeing is as array of shape=[6000,200].
Using the h5ls command line tool can help you spot the problem.
In matlab you saved
>> hdf5write('my_data.h5', '/new_train_x',
single( reshape(new_train_x,[200, 6000, 1, 1]) );
>> hdf5write('my_data.h5', '/label_train',
single( reshape(label_train,[200, 1, 1, 1]) ),
'WriteMode', 'append' );
Now you can inspect the resulting my_data.h5 using h5ls (in Linux terminal):
user#host:~/$ h5ls ./my_data.h5
label_train Dataset {200}
new_train_x Dataset {6000, 200}
As you can see, the arrays are written "backwards".
Solution
Taking this conflict into account when exporting data from Matlab, you should permute:
load data.mat
hdf5write('my_data.h5', '/new_train_x',
single( permute(reshape(new_train_x,[200, 6000, 1, 1]),[4:-1:1] ) );
hdf5write('my_data.h5', '/label_train',
single( permute(reshape(label_train,[200, 1, 1, 1]), [4:-1:1] ) ),
'WriteMode', 'append' );
Inspect the resulting my_data.h5 using h5ls now results with:
user#host:~/$ h5ls ./my_data.h5
label_train Dataset {200, 1, 1, 1}
new_train_x Dataset {200, 6000, 1, 1}
Which is what you expected in the first place.

[caffe]: check fails: Check failed: hdf_blobs_[i]->shape(0) == num (200 vs. 6000)

I have the train and label data as data.mat. (I have 200 training data with 6000 features and labels are (-1, +1) that have saved in data.mat).
I am trying to convert my data in hdf5 and run Caffe using:
load data.mat
hdf5write('my_data.h5', '/new_train_x', single( reshape(new_train_x,[200, 6000, 1, 1]) ) );
hdf5write('my_data.h5', '/label_train', single( reshape(label_train,[200, 1, 1, 1]) ), 'WriteMode', 'append' );
And my layer.prototxt (just data layer) is:
layer {
type: "HDF5Data"
name: "data"
top: "new_train_x" # note: same name as in HDF5
top: "label_train" #
hdf5_data_param {
source: "/path/to/list/file.txt"
batch_size: 20
}
include { phase: TRAIN }
}
but, i have an error:
( Check failed: hdf_blobs_[i]->shape(0) == num (200 vs. 6000))
I1222 17:02:48.915861 3941 layer_factory.hpp:76] Creating layer data
I1222 17:02:48.915871 3941 net.cpp:110] Creating Layer data
I1222 17:02:48.915877 3941 net.cpp:433] data -> new_train_x
I1222 17:02:48.915890 3941 net.cpp:433] data -> label_train
I1222 17:02:48.915900 3941 hdf5_data_layer.cpp:81] Loading list of HDF5 filenames from: file.txt
I1222 17:02:48.915923 3941 hdf5_data_layer.cpp:95] Number of HDF5 files: 1
F1222 17:02:48.993865 3941 hdf5_data_layer.cpp:55] Check failed: hdf_blobs_[i]->shape(0) == num (200 vs. 6000)
*** Check failure stack trace: ***
# 0x7fd2e6608ddd google::LogMessage::Fail()
# 0x7fd2e660ac90 google::LogMessage::SendToLog()
# 0x7fd2e66089a2 google::LogMessage::Flush()
# 0x7fd2e660b6ae google::LogMessageFatal::~LogMessageFatal()
# 0x7fd2e69f9eda caffe::HDF5DataLayer<>::LoadHDF5FileData()
# 0x7fd2e69f901f caffe::HDF5DataLayer<>::LayerSetUp()
# 0x7fd2e6a48030 caffe::Net<>::Init()
# 0x7fd2e6a49278 caffe::Net<>::Net()
# 0x7fd2e6a9157a caffe::Solver<>::InitTrainNet()
# 0x7fd2e6a928b1 caffe::Solver<>::Init()
# 0x7fd2e6a92c19 caffe::Solver<>::Solver()
# 0x41222d caffe::GetSolver<>()
# 0x408ed9 train()
# 0x406741 main
# 0x7fd2e533ca40 (unknown)
# 0x406f69 _start
Aborted (core dumped)
Many thanks!!!! Any advice would be appreciated!
The problem
It seems like there is indeed a conflict with the order of elements in arrays: matlab arranges the elements from the first dimension to the last (like fortran), while caffe and hdf5 stores the arrays from last dimension to first:
Suppose we have X of shape nxcxhxw then the "second element of X" is X[2,1,1,1] in matlab but X[0,0,0,1] in C (1-based vs 0-based indexing doesn't make life easier at all).
Therefore, when you save an array of size=[200, 6000, 1, 1] in Matlab, what hdf5 and caffe are actually seeing is as array of shape=[6000,200].
Using the h5ls command line tool can help you spot the problem.
In matlab you saved
>> hdf5write('my_data.h5', '/new_train_x',
single( reshape(new_train_x,[200, 6000, 1, 1]) );
>> hdf5write('my_data.h5', '/label_train',
single( reshape(label_train,[200, 1, 1, 1]) ),
'WriteMode', 'append' );
Now you can inspect the resulting my_data.h5 using h5ls (in Linux terminal):
user#host:~/$ h5ls ./my_data.h5
label_train Dataset {200}
new_train_x Dataset {6000, 200}
As you can see, the arrays are written "backwards".
Solution
Taking this conflict into account when exporting data from Matlab, you should permute:
load data.mat
hdf5write('my_data.h5', '/new_train_x',
single( permute(reshape(new_train_x,[200, 6000, 1, 1]),[4:-1:1] ) );
hdf5write('my_data.h5', '/label_train',
single( permute(reshape(label_train,[200, 1, 1, 1]), [4:-1:1] ) ),
'WriteMode', 'append' );
Inspect the resulting my_data.h5 using h5ls now results with:
user#host:~/$ h5ls ./my_data.h5
label_train Dataset {200, 1, 1, 1}
new_train_x Dataset {200, 6000, 1, 1}
Which is what you expected in the first place.