How can I conditionally skip a parameterized pytest scenario? - pytest

I need to flag certain tests to be skipped. However, some of the tests are parameterized and I need to be able to skip only certain of the scenarios.
I invoke the test using py.test -m "hermes_only" or py.test -m "not hermes_only" as appropriate.
Simple testcases are marked using:
#pytest.mark.hermes_only
def test_blah_with_hermes(self):
However, I have some parameterized tests:
outfile_scenarios = [('buildHermes'),
('buildTrinity')]
#pytest.mark.parametrize('prefix', outfile_scenarios)
def test_blah_build(self, prefix):
self._activator(prefix=prefix)
I would like a mechanism to filter the scenario list or otherwise skip certain tests if a pytest mark is defined.
More generally, how can I test for the definition of a pytest mark?
Thank you.

A nice solution from the documentation is this:
import pytest
#pytest.mark.parametrize(
("n", "expected"),
[
(1, 2),
pytest.param(1, 0, marks=pytest.mark.xfail),
pytest.param(1, 3, marks=pytest.mark.xfail(reason="some bug")),
(2, 3),
(3, 4),
(4, 5),
pytest.param(
10, 11, marks=pytest.mark.skipif(sys.version_info >= (3, 0), reason="py2k")
),
],
)
def test_increment(n, expected):
assert n + 1 == expected

Found it! It's elegant in its simplicity. I just mark the affected scenarios:
outfile_scenarios = [pytest.mark.hermes_only('buildHermes'),
('buildTrinity')]
I hope this helps others.

Related

Pyspark Usage of Col() Function

I am new to spark.I have faced a confusion in which there are multiple ways to access a column of a dataframe. For example:
df=select(df.columnname).show()
other way:
df=select(col("columnname")).show()
There are scenarios where we can not use direct column name and we have to use col() thing. However, I am unable to find those scearios where we have to accesss column name using col() otherwise it will throw the error.
This would be an example that would fail without using col():
df = df.withColumn('col3', when(df.col1 > 2, 5)) \
.withColumn('col4', when(df.col3 < 6, 4))
If you want to access a column that is part of the projection chain, but not of the data frame object yet.
This would work:
df = df.withColumn('col3', when(df.col1 > 2, 5)) \
.withColumn('col4', when(col("col3") < 6, 4))

PyTest Mark some parameters as slow but not others [duplicate]

I have been trying to parameterize my tests using #pytest.mark.parametrize, and I have a marketer #pytest.mark.test("1234"), I use the value from the test marker to do post the results to JIRA. Note the value given for the marker changes for every test_data. Essentially the code looks something like below.
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"),[
(1, 2),
(2, 3)])
def test_increment(n, expected):
assert n + 1 == expected
I want to do something like
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
(1, 2,#pytest.mark.test("T1")),
(2, 3,#pytest.mark.test("T2"))
])
How to add the marker when using parameterized tests given that the value of the marker is expected to change with each test?
It's explained here in the documentation: https://docs.pytest.org/en/stable/example/markers.html#marking-individual-tests-when-using-parametrize
To show it here as well, it'd be:
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
pytest.param(1, 2, marks=pytest.mark.T1),
pytest.param(2, 3, marks=pytest.mark.T2),
(4, 5)
])

Resue previous example in pytest parametrize

Consider a parametrized pytest test, which reuses the same complex
example a number of times. To keep the sample code as simple as possible,
I have simulated the 'complex' examples with very long integers.
from operator import add
from pytest import mark
parm = mark.parametrize
#parm(' left, right, result',
((9127384955, 1, 9127384956),
(9127384955, 2, 9127384957),
(9127384955, 3, 9127384958),
(9127384955, 4, 9127384959),
(4729336234, 1, 4729336235),
(4729336234, 2, 4729336236),
(4729336234, 3, 4729336237),
(4729336234, 4, 4729336238),
))
def test_one(left, right, result):
assert add(left, right) == result
The first four (and the next four) examples use exactly the same value for left but:
I have to read the examples carefully to realize this
This repetition is verbose
I would like to make it absolutely clear that exactly the same
example is being reused and save myself the need to repeat the same example many times. (Of course, I could bind the example to a global variable, and use that variable, but that variable would have to be bound at some distant point outside of my collection of examples, and I want to see the actual example in the context in which it is used (i.e. near to the other values used in this particular set) rather than having to look for it elsewhere.
Here is an implementation which allows me to make this explicit using a syntax that I find perfectly acceptable, but the implementation itself is horrible: it uses a global variable and doesn't stand a chance of working with distributed test execution.
class idem: pass
#parm(' left, right, result',
((9127384955, 1, 9127384956),
( idem , 2, 9127384957),
( idem , 3, 9127384958),
( idem , 4, 9127384959),
(4729336234, 1, 4729336235),
( idem , 2, 4729336236),
( idem , 3, 4729336237),
( idem , 4, 4729336238),
))
def test_two(left, right, result):
global previous_left
if left is idem: left = previous_left
else : previous_left = left
assert add(left, right) == result
How can this idea be implemented in a more robust way? Is there some feature built in to pytest that could help?

Obtaining inconsistent results in Spark

Have any Spark experts had strange experience: obtaining inconsistent map-reduce results using pypark?
Suppose in the midway, I have a RDD
....
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...])
My goal is to aggregate how many different users, so I do
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect()))
print (rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect())
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).map(lambda x: x[0]).collect()))
These three prints should have the same content (though in different formats). For example, the first one is a set of set({('Alex', 1), ('John', 10), ('Joe', 2)...}); second a list of [('Alex', 1), ('John', 10), ('Joe', 2)...]. The number of the items should be equal to the number of different users. Third is a set({'Alex', 'John', 'Joe'...})
But instead I got set({('Alex', 1), ('John', 2), ('Joe', 3)...}); second a list of [('John', 5), ('Joe', 2)...]('Alex' is even missing here). The lengths of the set and list are different.
Unfortunately, I even cannot reproduce the error if I only write a short test code; still get right results. Did any meet this problem before?
I think I figured out.
The reason is that if I used the same RDD frequently, I need to .cache().
If the RDD becomes
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...]).cache()
then the inconsistent problem solved.
Or, if I further prepare the aggregated rdd as
aggregated_rdd = rdd.map(lambda x: (x[0][0],1)).reduceByKey(add)
print (set(aggregated_rdd.collect()))
print (aggregated_rdd.collect())
print (set(aggregated_rdd.map(lambda x: x[0]).collect()))
then there are no inconsistent problems neither.

Is there a issue with using NetworkX from multiple processes on different graphs?

If each process creates its own nx.Graph() and adds/removes nodes/edges to it, is there any reason for them to collide? I am noticing some weird phenomenoms and trying to debug them.
The general problem is that I am dumping a single graph as an edge list, and recreating it from a subset in each process into a new graph. for some reason those new graphs are missing edges.
EDIT:
I think I found the part of code which causes the problems for me, the question is whether the following is the intended behaviour of NetworkX or not:
>>> import networkx as nx
>>> g = nx.Graph()
>>> g.add_path([0,1,2,3])
>>> g.nodes()
[0, 1, 2, 3]
>>> g.edges()
[(0, 1), (1, 2), (2, 3)]
>>> g[1][0]
{}
>>> g[0][1] = {"test":1}
>>> g.edges(data=True)
[(0, 1, {'test': 1}), (1, 2, {}), (2, 3, {})]
>>> g[1][0]
{}
>>> g[0][1]
{'test': 1}
>>>
Since the graph is a undirectional one I would expect the edge data to appear both regardless to the nodes id in the request, is that an incorrect assumption?
In general there should be no issue with multiple processes as long as they each have their own Graph() object.
EDIT:
In your case you are explicitly assigning data to the internal NetworkX Graph structure with the line
>>> g[0][1] = {"test":1}
While that is technically allowed it breaks the API and data structure. You should instead use
>>> g.add_edge(0,1,test=1)
which won't add a new edge here, only a new attribute. Doing it that way assigns the data to g[0][1] and g[1][0] correctly.