AggregateByKey Function with key-value pairs in Pyspark - pyspark

I have a question regarding aggregatebykey in pyspark.
I have an RDD dataset as follows:
premierRDD=[('Chelsea', ('2016–2017', 93)), ('Chelsea', ('2015–2016', 50))]
I wish to sum up the scores of 50 & 93 using the aggegrateByKey function, and my expected output should be:
[('Chelsea', '2016–2017', (93,143)), ('Chelsea', '2015–2016', (50,143))]
seqFunc = (lambda x, y: ('', x[0] + y[1]))
combFunc = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
premierAgg = premierMap.aggregateByKey((0,0), seqFunc, combFunc)
However, I get this output instead:
[('Chelsea', ('', 143))]
Can someone advise me how to use the aggregrateByKey function appropriately?

I adjusted your code to achieve the desired result. First you need to maintain the "year" value in the seqFunc. Therefore I added y[0] there. Then the combination has to be changed to not only contain the sum, but also the original value in a tuple. Furthermore the year value remains as well. This would result in [('Chelsea', [(u'2016-2017', (93, 143)), (u'2015-2016', (50, 143))])], as I explained in the comment same keys will be combined. To achieve your output with 2 times chelsea you can just use an additional map function as described.
rdd = sc.parallelize([('Chelsea', (u"2016-2017", 93)), ('Chelsea', (u"2015-2016", 50))])
seqFunc = (lambda x, y: (y[0], x[0] + y[1]))
combFunc = (lambda x, y: [(x[0], (x[1],x[1] + y[1])),(y[0],(y[1],x[1]+y[1]))])
premierAgg = rdd.aggregateByKey((0,0), seqFunc,combFunc)
print premierAgg.map(lambda r: [(r[0], a) for a in r[1]]).collect()[0]
Output:
[('Chelsea', (u'2016-2017', (93, 143))), ('Chelsea', (u'2015-2016', (50, 143)))]

Related

scipy.optimize.minimize - Multivariate optimization

I am looking to minimize an objective function subject to certain constraints.
The function that I am looking to minimize is:
def distance_function(choice_matrix, distance_matrix, factory_distance):
hub_to_demand_distance = distance_matrix.dot(choice_matrix)
hub_factory_distance = pd.concat([hub_to_demand_distance, factory_distance],axis=1)
min_dist_to_demand = pd.DataFrame(hub_factory_distance.min(axis=1))
transposed_choice = choice_matrix.T
factory_to_hub = transposed_choice.dot(factory_distance)
total_distance = min_dist_to_demand.sum(axis=0)+factory_to_hub.sum(axis=0)
return total_distance
These are the constraints that I have defined:
cons = (
{'type':'ineq','fun': lambda f: 1-choice_matrix[0][0]-choice_matrix[1][0]},
{'type':'ineq','fun': lambda f: 1-choice_matrix[0][1]-choice_matrix[1][1]},
{'type':'ineq','fun': lambda f: 1-choice_matrix[0][2]-choice_matrix[1][2]},
{'type':'ineq','fun': lambda f: 1-choice_matrix[0][3]-choice_matrix[1][3]},
{'type':'eq','fun': lambda f: choice_matrix[0][0]+choice_matrix[0][1]+choice_matrix[0][2]+choice_matrix[0][3]-1},
{'type':'eq','fun': lambda f: choice_matrix[1][0]+choice_matrix[1][1]+choice_matrix[1][2]+choice_matrix[1][3]-1}
)
I have tried using Scipy Optimize to minimize the function as shown:
optimize.minimize(distance_function, choice_matrix, args=(distance_matrix, factory_distance),method='SLSQP',jac=None,constraints=cons)
When I run this, I get the following error:
ValueError: Dot product shape mismatch, (4, 4) vs (8,)
Could you please tell me:
Why this is happening and what needs to be done?
In the code that I shown, I have taken Choice Matrix to have 4 rows and 2 columns and hence I have manually defined 6 constraints (The constraint is the sum of the elements in each row should be lesser than or equal to 1. The other constraint is the sum of the elements in each column should be equal to 1)
My question is if my Choice Matrix has 40 rows and 5 columns, is there a better way to define the constraints than manually entering 45 lines?
Thank you in advance for your help!

Problem in Implementing a Graphical Model Using Pyro

I am trying to implement this graphical model using Pyro:
My implementation is:
def model(data):
p = pyro.sample('p', dist.Beta(1, 1))
label_axis = pyro.plate("label_axis", data.shape[0], dim=-3)
f_axis = pyro.plate("f_axis", data.shape[1], dim=-2)
with label_axis:
l = pyro.sample('l', dist.Bernoulli(p))
with f_axis:
e = pyro.sample('e', dist.Beta(1, 10))
with label_axis, f_axis:
f = pyro.sample('f', dist.Bernoulli(1-e), obs=data)
f = l*f + (1-l)*(1-f)
return f
However, this doesn't seem to be right to me. The problem is "f". Since its distribution is different from Bernoulli. To sample from f, I used a sample from a Bernoulli distribution and then changed the sampled value if l=0. But I don't think that this would change the value that Pyro stores behind the scene for "f". This would be a problem when it's inferencing, right?
I wanted to use iterative plates instead of vectorized one, to be able to use control statements inside my plate. But apparently, this is not possible since I am reusing plates.
How can I correctly implement this PGM? Do I need to write a custom distribution? Or can I hack Pyro and change the stored value for "f" myself? Any type of help is appreciated! Cheers!
Here is the correct implementation:
import pyro
import pyro.distributions as dist
from pyro.infer import MCMC, NUTS
def model(data):
p = pyro.sample('p', dist.Beta(1, 1))
label_axis = pyro.plate("label_axis", data.shape[0], dim=-2)
f_axis = pyro.plate("f_axis", data.shape[1], dim=-1)
with label_axis:
l = pyro.sample('l', dist.Bernoulli(p))
with f_axis:
e = pyro.sample('e', dist.Beta(1, 10))
with label_axis, f_axis:
prob = l * (1 - e) + (1 - l) * e
return pyro.sample('f', dist.Bernoulli(prob), obs=data)
mcmc = MCMC(NUTS(model), 500, 500)
data = dist.Bernoulli(0.5).sample((20, 4))
mcmc.run(data)

Spark Exponential Moving Average

I have a dataframe of timeseries pricing data, with an ID, Date and Price.
I need to compute the Exponential Moving Average for the Price Column, and add it as a new column to the dataframe.
I have been using Spark's window functions before, and it looked like a fit for this use case, but given the formula for the EMA:
EMA: {Price - EMA(previous day)} x multiplier + EMA(previous day)
where
multiplier = (2 / (Time periods + 1)) //let's assume Time period is 10 days for now
I got a bit confused as to how can I access to the previous computed value in the column, while actually window-ing over the column.
With a simple moving average, it's simple, since all you need to do is compute a new column while averaging the elements in the window:
var window = Window.partitionBy("ID").orderBy("Date").rowsBetween(-windowSize, Window.currentRow)
dataFrame.withColumn(avg(col("Price")).over(window).alias("SMA"))
But it seems that with EMA its a bit more complicated since at every step I need the previous computed value.
I have also looked at Weighted moving average in Pyspark but I need an approach for Spark/Scala, and for a 10 or 30 days EMA.
Any ideas?
In the end, I've analysed how exponential moving average is implemented in pandas dataframes. Besides the recursive formula which I described above, and which is difficult to implement in any sql or window function(because its recursive), there is another one, which is detailed on their issue tracker:
y[t] = (x[t] + (1-a)*x[t-1] + (1-a)^2*x[t-2] + ... + (1-a)^n*x[t-n]) /
((1-a)^0 + (1-a)^1 + (1-a)^2 + ... + (1-a)^n).
Given this, and with additional spark implementation help from here, I ended up with the following implementation, which is roughly equivalent with doing pandas_dataframe.ewm(span=window_size).mean().
def exponentialMovingAverage(partitionColumn: String, orderColumn: String, column: String, windowSize: Int): DataFrame = {
val window = Window.partitionBy(partitionColumn)
val exponentialMovingAveragePrefix = "_EMA_"
val emaUDF = udf((rowNumber: Int, columnPartitionValues: Seq[Double]) => {
val alpha = 2.0 / (windowSize + 1)
val adjustedWeights = (0 until rowNumber + 1).foldLeft(new Array[Double](rowNumber + 1)) { (accumulator, index) =>
accumulator(index) = pow(1 - alpha, rowNumber - index); accumulator
}
(adjustedWeights, columnPartitionValues.slice(0, rowNumber + 1)).zipped.map(_ * _).sum / adjustedWeights.sum
})
dataFrame.withColumn("row_nr", row_number().over(window.orderBy(orderColumn)) - lit(1))
.withColumn(s"$column$exponentialMovingAveragePrefix$windowSize", emaUDF(col("row_nr"), collect_list(column).over(window)))
.drop("row_nr")
}
(I am presuming the type of the column for which I need to compute the exponential moving average is Double.)
I hope this helps others.

Consolidating a data table in Scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!
This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.
I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.