Can I Pair Two Sequences Together by a Matching Key? - system.reactive

Let's say sequence one is going out to the web to retrieve the contents of sites 1, 2, 3, 4, 5 (but will return in unpredictable order).
Sequence two is going to a database to retrieve context about these same records 1, 2, 3, 4, 5 (but for the purposes of this example will return in unpredictable order).
Is there an Rx extension method that will combine these into one sequence when each matching pair is ready in both sequences? Ie, if the first sequence returns in the order 4,2,3,5,1 and the second sequence returns in the order 1,4,3,2,5, the merged sequence would be (4,4), (3,3), (2,2), (1,1), (5,5) - as soon as each pair is ready. I've looked at Merge and Zip but they don't seem to be exactly what I'm looking for.
I wouldn't want to discard pairs that don't match, which I think rules out a simple .Where.Select combination.

var paired = Observable
.Merge(aSource, bSource)
.GroupBy(i => i)
.SelectMany(g => g.Buffer(2).Take(1));
The test below gives the correct results. It's just taking ints at the moment, if you're using data with keys and values, then you'll need to group by i.Key instead of i.
var aSource = new Subject<int>();
var bSource = new Subject<int>();
paired.Subscribe(g => Console.WriteLine("{0}:{1}", g.ElementAt(0), g.ElementAt(1)));
aSource.OnNext(4);
bSource.OnNext(1);
aSource.OnNext(2);
bSource.OnNext(4);
aSource.OnNext(3);
bSource.OnNext(3);
aSource.OnNext(5);
bSource.OnNext(2);
aSource.OnNext(1);
bSource.OnNext(5);
yields:
4:4
3:3
2:2
1:1
5:5
Edit in response to Brandon:
For the situation where the items are different classes (AClass and BClass), the following adjustment can be made.
using Pair = Tuple<AClass, BClass>;
var paired = Observable
.Merge(aSource.Select(a => new Pair(a, null)), bSource.Select(b => new Pair(null, b)))
.GroupBy(p => p.Item1 != null ? p.Item1.Key : p.Item2.Key)
.SelectMany(g => g.Buffer(2).Take(1))
.Select(g => new Pair(
g.ElementAt(0).Item1 ?? g.ElementAt(1).Item1,
g.ElementAt(0).Item2 ?? g.ElementAt(1).Item2));

So you have 2 observable sequences that you want to pair together?
Pair from Rxx along with GroupBy can help here. I think code similar to the following might do what you want
var pairs = stream1.Pair(stream2)
.GroupBy(pair => pair.Switch(source1 => source1.Key, source2 => source2.Key))
.SelectMany(group => group.Take(2).ToArray()) // each group will have at most 2 results (1 left and 1 right)
.Select(pair =>
{
T1 result1 = default(T1);
T2 result2 = default(T2);
foreach (var r in pair)
{
if (r.IsLeft) result1 = r.Left;
else result2 = r.Right;
}
return new { result1, result2 };
});
```
I've not tested it, and not added in anything for error handling, but I think this is what you want.

Related

Scala - create a new list and update particular element from existing list

I am new to Scala and new OOP too. How can I update a particular element in a list while creating a new list.
val numbers= List(1,2,3,4,5)
val result = numbers.map(_*2)
I need to update third element only -> multiply by 2. How can I do that by using map?
You can use zipWithIndex to map the list into a list of tuples, where each element is accompanied by its index. Then, using map with pattern matching - you single out the third element (index = 2):
val numbers = List(1,2,3,4,5)
val result = numbers.zipWithIndex.map {
case (v, i) if i == 2 => v * 2
case (v, _) => v
}
// result: List[Int] = List(1, 2, 6, 4, 5)
Alternatively - you can use patch, which replaces a sub-sequence with a provided one:
numbers.patch(from = 2, patch = Seq(numbers(2) * 2), replaced = 1)
I think the clearest way of achieving this is by using updated(index: Int, elem: Int). For your example, it could be applied as follows:
val result = numbers.updated(2, numbers(2) * 2)
list.zipWithIndex creates a list of pairs with original element on the left, and index in the list on the right (indices are 0-based, so "third element" is at index 2).
val result = number.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}
This creates an intermediate list holding the pairs, and then maps through it to do your transformation. A bit more efficient approach is to use iterator. Iterators a 'lazy', so, rather than creating an intermediate container, it will generate the pairs one-by-one, and send them straight to the .map:
val result = number.iterator.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}.toList
1st and the foremost scala is FOP and not OOP. You can update any element of a list through the keyword "updated", see the following example for details:
Signature :- updated(index,value)
val numbers= List(1,2,3,4,5)
print(numbers.updated(2,10))
Now here the 1st argument is the index and the 2nd argument is the value. The result of this code will modify the list to:
List(1, 2, 10, 4, 5).

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

How do I read the value of a continuous index from spark RDD

I have a problem with Spark Scala get the first value from series key,I create a new RDD like this:
[(a,1),(a,2),(a,3),(a,4),(b,1),(b,2),(a,3),(a,4),(a,5),(b,8),(b,9)]
I want to fetch the result like this:
[(a,1),(b,1),(a,3),(b,8)]
How can I do this with scala from RDD
As mentioned in comments, in order to be able to use the order of the elements in an RDD, you'd have to somehow represent this order in the data itself. For that purpose exactly, zipWithIndex was created - the index is added to the data; Then, with some manipulation (join on an RDD with modified indices) we can get what you need:
// add index to RDD:
val withIndex = rdd.zipWithIndex().map(_.swap)
// create another RDD with indices increased by one, to later join each element with the previous one
val previous = withIndex.map { case (index, v) => (index + 1, v) }
// join RDDs, filter out those where previous "key" is identical
val result = withIndex.leftOuterJoin(previous).collect {
case (i, (left, None)) => (i, left) // keep first element in RDD
case (i, (left, Some((key, _)))) if left._1 != key => (i, left) // keep only elements where the previous key is different
}.sortByKey().values // if you want to preserve the original order...
result.collect().foreach(println)
// (a,1)
// (b,1)
// (a,3)
// (b,8)

Breakdown of reduce function

I currently have:
x.collect()
# Result: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val a = x.reduce((x,y) => x+1)
# a: Int = 6
val b = x.reduce((x,y) => y + 1)
# b: Int = 12
I have tried to follow what has been said here (http://www.python-course.eu/lambda.php) but still don't quite understand what the individual operations are that lead to these answers.
Could anyone please explain the steps being taken here?
Thank you.
The reason is that the function (x, y) => x + 1 is not associative. reduce requires an associative function. This is necessary to avoid indeterminacy when combining results across different partitions.
You can think of the reduce() method as grabbing two elements from the collection, applying them to a function that results in a new element, and putting that new element back in the collection, replacing the two it grabbed. This is done repeatedly until there is only one element left. In other words, that previous result is going to get re-grabbed until there are no more previous results.
So you can see where (x,y) => x+1 results in a new value (x+1) which would be different from (x,y) => y+1. Cascade that difference through all the iterations and ...

Is it possible to update a variable in foldLeft function based on a condition?

I am trying to write scala code which gives maximum sum from contiguous sub-array from the given array. For example, val arr= Array(-2, -3, 4, -1, -2, 1, 5, -3). In this array I need to get maximum contiguous sub-array sum i.e 4+(-1)+(-2)+(1)+5 = 7. I wrote following code for getting this result.
scala> arr.foldLeft(0) { (currsum,newnum) => if((currsum+newnum)<0) 0 else { if(currsum<(currsum+newnum)) (currsum+newnum) else currsum }}
res5: Int = 10
but deviated from actual result as I am unable to update the maximum_so_farvalue as the counting/summation goes on. Since I have used foldLeft to do this functionality, is there any possibility to update the maximum_so_farvariable only when sum of contiguous sub array elements is greater than previous max_sum?
reference link for better understanding of scenario
Well, obviously you have to propagate two values along your input data for this computation, just as you would need to do in the imperative case:
arr.foldLeft((0,0)){
case ((maxSum, curSum), value) => {
val newSum = Math.max(0, curSum + value)
(Math.max(maxSum, newSum), newSum)
}
}._1
An other way would be to compute the intermediate results (lazily if you want) and then select the maximum:
arr.toIterator.scanLeft(0){
case (curSum, value) =>
Math.max(0, curSum + value)
}.max