Inconsistent edge data in graph in networkx - networkx

Using networkx I created a graph G and added the edge (1,2). Strangely,
(1,2) in G.edges() == False
but
G.has_edge(1,2) == True
I'm baffled. This is a concise summary of the problem. The actual graph data is large and cannot be reproduced here in full.

The issue is that G.edges() only has each edge once. Since it's an undirected graph, it could be as either (1,2) or as (2,1) (you wouldn't want both to appear --- you might be iterating and delete the edge the first time). You can't be sure of what order it will do so because it's coming from a python dict, which does not have predictable order. You could do (1,2) in G.edges() or (2,1) in G.edges(), but you really don't want to do that - it's inefficient to create the edge lists.
So the test should use G.has_edge(1,2) which checks it correctly (and much more efficiently).
Just to show the lack of predictability for different python implementations, this is what I get
In [3]: G=nx.Graph()
In [4]: G.add_edge(1,2)
In [5]: G.edges()
Out[5]: [(1, 2)]
In [6]: (1,2) in G.edges()
Out[6]: True
In [7]: G.add_edge(4,3)
In [8]: G.edges()
Out[8]: [(1, 2), (3, 4)]
In [9]: (4,3) in G.edges()
Out[9]: False

Related

Polars searchsorted with a Series

searchsorted is an incredibly useful utility in numpy and pandas for performing a binary search on every element in a list, especially for time-series data.
import numpy as np
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]
I have a few questions about Polars
Is there any way to apply search_sorted on a list in polars in a vectorized manner?
Is there any way to specify side=right for search_sorted?
Can we use non-numeric data in search_sorted?
If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?
(The ideal approach is if search_sorted can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B)))
Here's what I have tried:
import polars as pl
pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8
pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented
list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results
pl.DataFrame({
'a': [0, 0, 1, 2],
'b': [0, 1, 2, 3],
}).with_columns([
pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect
I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!
To extend on #ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately Utf8 dtypes so for your example you have to do an extra step.
Starting from:
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
then we make a df to house all the possible values of a and map them to a numeric.
dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
.sort('a').with_row_count('valindx'))
now we add that valindx to each of df1 and df2
df1=df1.join(dfindx, on='a')
df2=df2.join(dfindx, on='a')
To get almost to the finish line you'd do:
df2.join_asof(df1, on='valindx', strategy='forward')
this will leave missing the last value, the 4 from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.
(df2.
join_asof(df1, on='valindx', strategy='forward')
.with_column(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1)[0,0]))
.get_column('idx'))
Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof.
If you change the strategy of join_asof then that should be largely the same as switching the side but you'd have to change the hack bit at the end too.
EDIT: I added the requested functionality and it will be available in next release: https://github.com/pola-rs/polars/pull/6083
Old answer (wrong)
For a "normal" search sorted we can use a join.
# convert to DataFrame
# provide polars with the information the data is sorted (this speeds up many algorithms)
# set a row count
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
# join
# drop duplicates
# and only show the indices that were joined
df1.join(df2, on="a", how="semi").unique(subset=["a"])["idx"]
Series: 'idx' [u32]
[
0
2
3
]

Read a matlab .mat file using h5py

I want to use Python3 package h5py to read a matlab .mat file of version 7.3.
It contains a variable in matlab, named results.
It contains a 1*1 cell, and the value in the struct inside is what I need.
In matlab, I can get these data through the following code:
load('.mat PATH');
results{1}.res
How should I read this data in h5py?
Example .mat file can be obtained from here
While h5py can read h5 files from MATLAB, figuring out what is there takes some exploring - looking at keys groups and datasets (and possibly attr). There's nothing in scipy that will help you (scipy.io.loadmat is for the old MATLAB mat format).
With the downloaded file:
In [61]: f = h5py.File('Downloads/Basketball_ECO_HC.mat','r')
In [62]: f
Out[62]: <HDF5 file "Basketball_ECO_HC.mat" (mode r)>
In [63]: f.keys()
Out[63]: <KeysViewHDF5 ['#refs#', 'results']>
In [65]: f['results']
Out[65]: <HDF5 dataset "results": shape (1, 1), type "|O">
In [66]: arr = f['results'][:]
In [67]: arr
Out[67]: array([[<HDF5 object reference>]], dtype=object)
In [68]: arr.item()
Out[68]: <HDF5 object reference>
I'd have to check the h5py docs to see if I can check that object reference further. I'm not familiar with it.
But exploring the other key:
In [69]: list(f.keys())[0]
Out[69]: '#refs#'
In [70]: f[list(f.keys())[0]]
Out[70]: <HDF5 group "/#refs#" (2 members)>
In [71]: f[list(f.keys())[0]].keys()
Out[71]: <KeysViewHDF5 ['a', 'b']>
In [72]: f[list(f.keys())[0]]['a']
Out[72]: <HDF5 dataset "a": shape (2,), type "<u8">
In [73]: _[:]
Out[73]: array([0, 0], dtype=uint64)
In [74]: f[list(f.keys())[0]]['b']
Out[74]: <HDF5 group "/#refs#/b" (7 members)>
In [75]: f[list(f.keys())[0]]['b'].keys()
Out[75]: <KeysViewHDF5 ['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type']>
In [76]: f[list(f.keys())[0]]['b']['fps']
Out[76]: <HDF5 dataset "fps": shape (1, 1), type "<f8">
In [77]: f[list(f.keys())[0]]['b']['fps'][:]
Out[77]: array([[22.36617883]])
In the OS shell , I can look at the file with h5dump. From that it looks like the res dataset has the most data. The datasets also have attributes. That may be a better way of getting an overview, and use that to guide the h5py loads.
In [80]: f[list(f.keys())[0]]['b']['res'][:]
Out[80]:
array([[198., 196., 195., ..., 330., 328., 326.],
[214., 214., 216., ..., 197., 196., 192.],
[ 34., 34., 34., ..., 34., 34., 34.],
[ 81., 81., 81., ..., 81., 80., 80.]])
In [81]: f[list(f.keys())[0]]['b']['res'][:].shape
Out[81]: (4, 725)
In [82]: f[list(f.keys())[0]]['b']['res'][:].dtype
Out[82]: dtype('<f8')
If your question is asking generally how to read matfiles saved using v7.3 in Python, the hdf5storage package provides some utilities that might work for you. In the case of your file (after installing the package) you would run
In [0]: import hdf5storage as hdf5
In [1]: pyIn = LoadMatFile('Basketball_ECO_HC.mat')
In [2]: type(pyIn)
Out[2]: dict
In [3]: pyIn.keys()
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape
Out[5]: (1, 1)
In [6]: pyIn['results'].dtype
Out[6]: dtype('O')
In [7]: pyIn['results'][0,0].dtype
Out[7]: dtype([('type', '<U4', (1, 1)), ('res', '<f8', (725, 4)), ('fps', '<f8', (1, 1)), ('fps_no_ftr', '<f8', (1, 1)), ('len', '<f8', (1, 1)), ('annoBegin', '<f8', (1, 1)), ('startFrame', '<f8', (1, 1))])
You can see it does a good job of parsing the input array, though it does things like collapsing the cell-in-a-cell that you would access in Matlab with results{1}{1} into a 2D numpy array you access with pyIn['results'][0,0] instead. Another odd thing I ran into with this data is the addition of a dimension in the deeper structure fields, as below:
In [8]: pyIn['results'][0,0]['res'].shape
Out[8]: (1, 725, 4)
In [9]: pyIn['results'][0,0]['res'][0,0,:]
Out[9]: array([198., 214., 34., 81.])
Not entirely sure why this happens, but in general it should work well.
That said, I did run into an issue with the latest version (0.2) of this package where for really deep array/cell/structure combos it became incredibly slow. The nice thing is that this package is still being maintained, so fixes for this might be in the pipeline. Nevertheless, this prompted me to write my own h5py reader for matfiles which is faster in these cases, and I'll discuss it as another answer.
As I mentioned in my other post on the hd5fstorage package, I have run into problems of it being far too slow when it comes to loading deep arrays. So I implemented my own matfile loader whose code might also be more useful (because it's compact) if you care about the specifics of how reading a v7.3 matfile into Python works. (That said, the code currently has very few comments, so maybe not that useful.)
For the case of my library, the outputs are very similar to hdf5storage, as shown here.
In [0]: from MatFileMethods import LoadMatFile
In [1]: pyIn = LoadMatFile('/Users/emilio/Downloads/Basketball_ECO_HC.mat')
In [2]: type(pyIn)
Out[2]: dict
In [3]: pyIn.keys()
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape
Out[5]: (1, 1)
Note that as with the hdf5storage package, the cell-within-a-cell in Matlab, which gets called using results{1}{1} becomes a two-dimensional numpy.ndarray which gets called with pyIn['results'][0,0], as below.
In [6]: type(pyIn['results'][0,0])
Out[6]: dict
In [7]: pyIn['results'][0,0].keys()
Out[7]: dict_keys(['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type'])
In [8]: pyIn['results'][0,0]['res'].shape
Out[8]: (725, 4)
In [9]: pyIn['results'][0,0]['res'][0,:]
Out[9]: array([198., 214., 34., 81.])
In contrast with hdf5storage, I opt to make Matlab structures into Python dicts, so that the fields of the structures are the keys of the dictionaries.
In any case, this module is by no means fully tested, but has served me well for loading ~500Mb and larger mat files that version 0.2 of hdf5storage doesn't seem to handle (~1.5 minutes for my own loader vs >10 minute loading time for hdf5storage (it hadn't finished loading at 10 minutes)). (I'll note that the 1.5 minutes still pales in comparison to Matlab's own <15s load times, so there's still room for improvement...)

Select One Element in Each Row of a Numpy Array by Column Indices [duplicate]

This question already has answers here:
NumPy selecting specific column index per row by using a list of indexes
(7 answers)
Closed 2 years ago.
Is there a better way to get the "output_array" from the "input_array" and "select_id" ?
Can we get rid of range( input_array.shape[0] ) ?
>>> input_array = numpy.array( [ [3,14], [12, 5], [75, 50] ] )
>>> select_id = [0, 1, 1]
>>> print input_array
[[ 3 14]
[12 5]
[75 50]]
>>> output_array = input_array[ range( input_array.shape[0] ), select_id ]
>>> print output_array
[ 3 5 50]
You can choose from given array using numpy.choose which constructs an array from an index array (in your case select_id) and a set of arrays (in your case input_array) to choose from. However you may first need to transpose input_array to match dimensions. The following shows a small example:
In [101]: input_array
Out[101]:
array([[ 3, 14],
[12, 5],
[75, 50]])
In [102]: input_array.shape
Out[102]: (3, 2)
In [103]: select_id
Out[103]: [0, 1, 1]
In [104]: output_array = np.choose(select_id, input_array.T)
In [105]: output_array
Out[105]: array([ 3, 5, 50])
(because I can't post this as a comment on the accepted answer)
Note that numpy.choose only works if you have 32 or fewer choices (in this case, the dimension of your array along which you're indexing must be of size 32 or smaller). Additionally, the documentation for numpy.choose says
To reduce the chance of misinterpretation, even though the following "abuse" is nominally supported, choices should neither be, nor be thought of as, a single array, i.e., the outermost sequence-like container should be either a list or a tuple.
The OP asks:
Is there a better way to get the output_array from the input_array and select_id?
I would say, the way you originally suggested seems the best out of those presented here. It is easy to understand, scales to large arrays, and is efficient.
Can we get rid of range(input_array.shape[0])?
Yes, as shown by other answers, but the accepted one doesn't work in general so well as what the OP already suggests doing.
I think enumerate is handy.
[input_array[enum, item] for enum, item in enumerate(select_id)]
How about:
[input_array[x,y] for x,y in zip(range(len(input_array[:,0])),select_id)]

Is there a issue with using NetworkX from multiple processes on different graphs?

If each process creates its own nx.Graph() and adds/removes nodes/edges to it, is there any reason for them to collide? I am noticing some weird phenomenoms and trying to debug them.
The general problem is that I am dumping a single graph as an edge list, and recreating it from a subset in each process into a new graph. for some reason those new graphs are missing edges.
EDIT:
I think I found the part of code which causes the problems for me, the question is whether the following is the intended behaviour of NetworkX or not:
>>> import networkx as nx
>>> g = nx.Graph()
>>> g.add_path([0,1,2,3])
>>> g.nodes()
[0, 1, 2, 3]
>>> g.edges()
[(0, 1), (1, 2), (2, 3)]
>>> g[1][0]
{}
>>> g[0][1] = {"test":1}
>>> g.edges(data=True)
[(0, 1, {'test': 1}), (1, 2, {}), (2, 3, {})]
>>> g[1][0]
{}
>>> g[0][1]
{'test': 1}
>>>
Since the graph is a undirectional one I would expect the edge data to appear both regardless to the nodes id in the request, is that an incorrect assumption?
In general there should be no issue with multiple processes as long as they each have their own Graph() object.
EDIT:
In your case you are explicitly assigning data to the internal NetworkX Graph structure with the line
>>> g[0][1] = {"test":1}
While that is technically allowed it breaks the API and data structure. You should instead use
>>> g.add_edge(0,1,test=1)
which won't add a new edge here, only a new attribute. Doing it that way assigns the data to g[0][1] and g[1][0] correctly.

How to convert a multiedges graph to a none directed graph in Networkx?

I have used read_graphml to load a graph, and it looks as if it returned a Multiedges graph object, which I can't run the PageRank method on (returns an exception that the graph must be non-multiedged).
Is there a way to convert my graph to a non-multiedged type (I don't think I have multi edges in the graph I loaded...).
Thanks
If the read_graphml() function returned a MultiGraph() object it probably found parallel (multiple) edges in the input file. But you can convert that to a graph without parallel edges simply by passing into a new Graph(). e.g.
In [1]: import networkx as nx
In [2]: G = nx.MultiGraph([(1,2),(1,2)])
In [3]: G.edges()
Out[3]: [(1, 2), (1, 2)]
In [4]: H = nx.Graph(G)
In [5]: H.edges()
Out[5]: [(1, 2)]