Obtaining inconsistent results in Spark - pyspark

Have any Spark experts had strange experience: obtaining inconsistent map-reduce results using pypark?
Suppose in the midway, I have a RDD
....
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...])
My goal is to aggregate how many different users, so I do
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect()))
print (rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).collect())
print (set(rdd.map(lambda x: (x[0][0],1)).reduceByKey(add).map(lambda x: x[0]).collect()))
These three prints should have the same content (though in different formats). For example, the first one is a set of set({('Alex', 1), ('John', 10), ('Joe', 2)...}); second a list of [('Alex', 1), ('John', 10), ('Joe', 2)...]. The number of the items should be equal to the number of different users. Third is a set({'Alex', 'John', 'Joe'...})
But instead I got set({('Alex', 1), ('John', 2), ('Joe', 3)...}); second a list of [('John', 5), ('Joe', 2)...]('Alex' is even missing here). The lengths of the set and list are different.
Unfortunately, I even cannot reproduce the error if I only write a short test code; still get right results. Did any meet this problem before?

I think I figured out.
The reason is that if I used the same RDD frequently, I need to .cache().
If the RDD becomes
add = sc.parallelize([(('Alex', item1), 3), (('Joe', item2), 1),...]).cache()
then the inconsistent problem solved.
Or, if I further prepare the aggregated rdd as
aggregated_rdd = rdd.map(lambda x: (x[0][0],1)).reduceByKey(add)
print (set(aggregated_rdd.collect()))
print (aggregated_rdd.collect())
print (set(aggregated_rdd.map(lambda x: x[0]).collect()))
then there are no inconsistent problems neither.

Related

PyTest Mark some parameters as slow but not others [duplicate]

I have been trying to parameterize my tests using #pytest.mark.parametrize, and I have a marketer #pytest.mark.test("1234"), I use the value from the test marker to do post the results to JIRA. Note the value given for the marker changes for every test_data. Essentially the code looks something like below.
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"),[
(1, 2),
(2, 3)])
def test_increment(n, expected):
assert n + 1 == expected
I want to do something like
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
(1, 2,#pytest.mark.test("T1")),
(2, 3,#pytest.mark.test("T2"))
])
How to add the marker when using parameterized tests given that the value of the marker is expected to change with each test?
It's explained here in the documentation: https://docs.pytest.org/en/stable/example/markers.html#marking-individual-tests-when-using-parametrize
To show it here as well, it'd be:
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
pytest.param(1, 2, marks=pytest.mark.T1),
pytest.param(2, 3, marks=pytest.mark.T2),
(4, 5)
])

print a specific partition of RDD / Dataframe

I have been experimenting with partitions and repartitioning of PySpark RDDs.
I noticed, when repartitioning a small sample RDD from 2 to 6 partitions, that simply a few empty parts are added.
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
Now, I wonder if that also happens in my real data.
It seems I can't use glom() on larger data (df with 192497 rows).
df.rdd.glom().collect()
Because when I try, nothing happens. It makes sense though, the resulting print would be enormous...
SO
I'd like to print each partition, to check if they are empty. or at least the top 20 elements of each partition.
any ideas?
PS: I found solutions for Spark, but I couldn't get them to work in PySpark...
How to print elements of particular RDD partition in Spark?
btw: if someone can explain to me why I get those empty partitions in the first place, I'd be all ears...
Or how I know when to expect this to happen and how to avoid this.
Or does it simply not influence performance, if there are empty partitions in a dataset?
Apparently (and surprisingly), rdd.repartition only doing coalesce, so, no shuffling, no wonder why the distribution is unequal. One way to go is using dataframe.repartition
rdd = sc.parallelize([1,2,3,43,54,678], 2)
rdd.glom().collect()
>>> [[1, 2, 3], [43, 54, 678]]
rdd6 = rdd.repartition(6)
rdd6.glom().collect()
>>> [[], [1, 2, 3], [], [], [], [43, 54, 678]]
rdd6_df = spark.createDataFrame(rdd, T.IntegerType()).repartition(6).rdd
rdd6_df.glom().collect()
[[Row(value=678)],
[Row(value=3)],
[Row(value=2)],
[Row(value=1)],
[Row(value=43)],
[Row(value=54)]]
concerning the possibility to check if partitions are empty, I came across a few solutions myself:
(if there aren't that many partitions)
rdd.glom().collect()
>>>nothing happens
rdd.glom().collect()[1]
>>>[1, 2, 3]
Careful though, it will truly print the whole partition. For my data it resulted in a few thousand lines of print. but it worked!
source: How to print elements of particular RDD partition in Spark?
count lines in each partition and show smallest/largest number.
l = df.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()
min(l,key=lambda item:item[1])
>>>(2, 61705)
max(l,key=lambda item:item[1])
>>>(0, 65875)
source: Spark Dataframes: Skewed Partition after Join

How to read csv in pyspark as different types, or map dataset into two different types

Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
print(reducedCOVID.take(1))
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
Thanks!
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()

Spark reducebykey returns y

I have a dataset
1, india, delhi
2, chaina, bejing
3, russia, mosco
2, england, London
When I perform
df.map(rec => (rec.split(",")(0).toInt, rec))
.reduceByKey((x,y)=> y)
.map(rec => rec._2)
.foreach {println }
Above code is returning below output. Usually reducebykey works as accumulated value and current value to sum values of same key, but here how it is working internally. What value x and what value y. And how it is returning y
1, india, delhi
2, chaina, bejing
3, russia, mosco
Re:"What value x and what value y", you can print to see their values. Make sure you check the executor logs and not driver to see this print statement. Moreover run it multiple times to see if they yield same values for x and y everytime. I do not think the order to read the records is guaranteed. It may not be evident with 4 records you are testing with above.
df.map(rec => (rec.split(",")(0).toInt, rec))
.reduceByKey((x,y)=> {println(s"x:$x,y:$y");y})
.map(rec => rec._2)
.foreach {println }
Re:"how it is working internally"
reduceByKey merges values for a Key based on the given function. This function is first run locally on each partition. The output for each partition is then shuffled based on the keys and then another reduce operation happens. This is similar to combiner function in Map-reduce. This helps in less amount of data needed to shuffle.
Generally this is used in place of groupByKey(), which results in shuffling at the beginning and then you get a chance to work on the values for the keys.
Attaching couple of pictures here to demonstrate this.
reduceByKey
groupByKey

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!