Get networkx filtered nodes from subgraph_view of filtered edges - networkx

I've created a subgraph_view by applying a filter to edges. When I call nodes() on the subgraph it still shows me all nodes, even if none of the edges use them. I need to get a list of only nodes that are still part of the subgraph.
G = nx.path_graph(6)
G[2][3]["cross_me"] = False
G[3][4]["cross_me"] = False
def filter_edge(n1, n2):
return G[n1][n2].get("cross_me", True)
view = nx.subgraph_view(G, filter_edge=filter_edge)
# node 3 is no longer used by any edges in the subgraph
view.edges()
This produces
EdgeView([(0, 1), (1, 2), (4, 5)])
as expected. However, when I run view.nodes() I get
NodeView((0, 1, 2, 3, 4, 5))
What I expect to see is
NodeView((0, 1, 2, 4, 5))
This seems odd. Is there some way to extract only the nodes used by the subgraph?

The confusion stems from the definition of 'graph.' A disconnected node is still a part of a graph. In fact, you could have a graph with no edges at all. So the behavior of subgraph_view() is counterintuitive but correct.
If, however, you still want to achieve what you're describing, there are lots of potential ways, depending on your tolerance for modifying the original graph. I'll mention two that attempt to stay as close to your current method as possible and avoid deleting edges or nodes from G.
Method 1
The easiest way using your view object is to take it as input to edge_subgraph() (which only takes edges as input) like this:
final_view = view.edge_subgraph(view.edges())
final_view.nodes()
gives
NodeView((0, 1, 2, 4, 5))
Method 2
To me, Method 1 seems clunky and confusing by defining an intermediate view. If instead we go back up a little bit and start with G, we could define a filter_node function that checks the edge attributes of each node and filters that node if
all edges are flagged for removal, or
the node has no edges in the first place.
You could also do this by manually flagging the node itself, as you've done with the edges.
G = nx.path_graph(6)
G[2][3]["cross_me"] = False
G[3][4]["cross_me"] = False
def filter_edge(n1, n2):
return G[n1][n2].get("cross_me", True)
def filter_node(n):
return sum([i[2].get("cross_me", True) for i in G.edges(n, data=True)])
view = nx.subgraph_view(G, filter_node=filter_node, filter_edge=filter_edge)
view.nodes()
also gives the expected
NodeView((0, 1, 2, 4, 5))

Related

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

Resue previous example in pytest parametrize

Consider a parametrized pytest test, which reuses the same complex
example a number of times. To keep the sample code as simple as possible,
I have simulated the 'complex' examples with very long integers.
from operator import add
from pytest import mark
parm = mark.parametrize
#parm(' left, right, result',
((9127384955, 1, 9127384956),
(9127384955, 2, 9127384957),
(9127384955, 3, 9127384958),
(9127384955, 4, 9127384959),
(4729336234, 1, 4729336235),
(4729336234, 2, 4729336236),
(4729336234, 3, 4729336237),
(4729336234, 4, 4729336238),
))
def test_one(left, right, result):
assert add(left, right) == result
The first four (and the next four) examples use exactly the same value for left but:
I have to read the examples carefully to realize this
This repetition is verbose
I would like to make it absolutely clear that exactly the same
example is being reused and save myself the need to repeat the same example many times. (Of course, I could bind the example to a global variable, and use that variable, but that variable would have to be bound at some distant point outside of my collection of examples, and I want to see the actual example in the context in which it is used (i.e. near to the other values used in this particular set) rather than having to look for it elsewhere.
Here is an implementation which allows me to make this explicit using a syntax that I find perfectly acceptable, but the implementation itself is horrible: it uses a global variable and doesn't stand a chance of working with distributed test execution.
class idem: pass
#parm(' left, right, result',
((9127384955, 1, 9127384956),
( idem , 2, 9127384957),
( idem , 3, 9127384958),
( idem , 4, 9127384959),
(4729336234, 1, 4729336235),
( idem , 2, 4729336236),
( idem , 3, 4729336237),
( idem , 4, 4729336238),
))
def test_two(left, right, result):
global previous_left
if left is idem: left = previous_left
else : previous_left = left
assert add(left, right) == result
How can this idea be implemented in a more robust way? Is there some feature built in to pytest that could help?

Create Guava ImmutableRangeSet from overlapping ranges

Apparently Guava's ImmutableRangeSet cannot store overlapping ranges. This makes sense, but is there an interface to resolve/merge overlapping ranges and then put the resultant ranges into an ImmutableRangeSet?
Currently I'm building a TreeRangeSet, which automatically merges overlapping ranges, and passing this as an argument to ImmutableRangeSet.builder().addAll(). This process works, but it seems a little too indirect just to resolve overlapping ranges.
Can you be more specific about your use case? I guess you have a collection of ranges and you're trying to create ImmutableRangeSet using copyOf method, which throws IAE in case of overlapping ranges. Let's see this test case:
#Test
public void shouldHandleOverlappingRanges()
{
//given
ImmutableList<Range<Integer>> ranges = ImmutableList.of(
Range.closed(0, 2),
Range.closed(1, 4),
Range.closed(9, 10)
);
//when
ImmutableRangeSet<Integer> rangeSet = ImmutableRangeSet.copyOf(ranges);
//then
assertThat(rangeSet.asSet(DiscreteDomain.integers()))
.containsOnly(0, 1, 2, 3, 4, 9, 10);
}
which fails with
java.lang.IllegalArgumentException:
Overlapping ranges not permitted but found [0..2] overlapping [1..4]
In this case you should use unionOf instead of copyOf and it'd pass:
//when
ImmutableRangeSet<Integer> rangeSet = ImmutableRangeSet.unionOf(ranges);

filling a matrix with Scala library breeze

I'm new to Scala and I'm having a mental block on a seemingly easy problem. I'm using the Scala library breeze and need to take an array buffer (mutable) and put the results into a matrix. This... should be simple but? Scala is so insanely type casted breeze seems really picky about what data types it will take when making a DenseVector. This is just some prototype code, but can anyone help me come up with a solution?
Right now I have something like...
//9 elements that need to go into a 3x3 matrix, 1-3 as top row, 4-6 as middle row, etc)
val numbersForMatrix: ArrayBuffer[Double] = (1, 2, 3, 4, 5, 6, 7, 8, 9)
//the empty 3x3 matrix
var M: breeze.linalg.DenseMatrix[Double] = DenseMatrix.zeros(3,3)
In breeze you can do stuff like
M(0,0) = 100 and set the first value to 100 this way,
You can also do stuff like:
M(0, 0 to 2) := DenseVector(1, 2, 3)
which sets the first row to 1, 2, 3
But I cannot get it to do something like...
var dummyList: List[Double] = List(1, 2, 3) //this works
var dummyVec = DenseVector[Double](dummyList) //this works
M(0, 0 to 2) := dummyVec //this does not work
and successfully change the first row to the 1, 2,3.
And that's with a List, not even an ArrayBuffer.
Am willing to change datatypes from ArrayBuffer but just not sure how to approach this at all... could try updating the matrix values one by one but that seems like it would be VERY hacky to code up(?).
Note: I'm a Python programmer who is used to using numpy and just giving it arrays. The breeze documentation doesn't provide enough examples with other datatypes for me to have been able to figure this out yet.
Thanks!
Breeze is, in addition to pickiness over types, pretty picky about vector shape: DenseVectors are column vectors, but you are trying to assign to a subset of a row, which expects a transposed DenseVector:
M(0, 0 to 2) := dummyVec.t

Testing a tensorflow network: in_top_k() replacement for multilabel classification

I've created a neural network in tensorflow. This network is multilabel. Ergo: it tries to predict multiple output labels for one input set, in this case three. Currently I use this code to test how accurate my network is at predicting the three labels:
_, indices_1 = tf.nn.top_k(prediction, 3)
_, indices_2 = tf.nn.top_k(item_data, 3)
correct = tf.equal(indices_1, indices_2)
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
percentage = accuracy.eval({champion_data:input_data, item_data:output_data})
That code works fine. The problem is now that I'm trying to create code that tests if the top 3 items it finds in indices_1 are amongst the top 5 images in indices_2. I know tensorflow has an in_top_k() method, but as far as I know that doesn't accept multilabel. Currently I've been trying to compare them using a for loop:
_, indices_1 = tf.nn.top_k(prediction, 5)
_, indices_2 = tf.nn.top_k(item_data, 3)
indices_1 = tf.unpack(tf.transpose(indices_1, (1, 0)))
indices_2 = tf.unpack(tf.transpose(indices_2, (1, 0)))
correct = []
for element in indices_1:
for element_2 in indices_2:
if element == element_2:
correct.append(True)
else:
correct.append(False)
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
percentage = accuracy.eval({champion_data:input_data, item_data:output_data})
However, that doesn't work. The code runs but my accuracy is always 0.0.
So I have one of two questions:
1) Is there an easy replacement for in_top_k() that accepts multilabel classification that I can use instead of custom writing code?
2) If not 1: what am I doing wrong that results in me getting an accuracy of 0.0?
When you do
correct = tf.equal(indices_1, indices_2)
you are checking not just whether those two indices contain the same elements but whether they contain the same elements in the same positions. This doesn't sound like what you want.
The setdiff1d op will tell you which indices are in indices_1 but not in indices_2, which you can then use to count errors.
I think being too strict with the correctness check might be what is causing you to get a wrong result.