Intersection of Two Map rdd's in Scala - scala

I have two RDD's, for example:
firstmapRDD - (0-14,List(0, 4, 19, 19079, 42697, 444, 42748))
secondmapRdd-(0-14,List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94))
I want to find the intersection.
I tried, var interResult = firstmapRDD.intersection(secondmapRdd), which shows no result in output file.
I also tried , cogrouping based on keys, mapRDD.cogroup(secondMapRDD).filter(x=>), but I don't know how to find the intersection between both the values, is it x=>x._1.intersect(x._2), Can someone help me with the syntax?
Even this throws a compile time error, mapRDD.cogroup(secondMapRDD).filter(x=>x._1.intersect(x._2))
var mapRDD = sc.parallelize(map.toList)
var secondMapRDD = sc.parallelize(secondMap.toList)
var interResult = mapRDD.intersection(secondMapRDD)
It may be because of ArrayBuffer[List[]] values, because of which the intersection is not working. Is there any hack to remove it?
I tried doing this
var interResult = mapRDD.cogroup(secondMapRDD).filter{case (_, (l,r)) => l.nonEmpty && r.nonEmpty }. map{case (k,(l,r)) => (k, l.toList.intersect(r.toList))}
Still getting an empty list!

Since you are looking intersect on values, you need to join both RDDs, get all the matched values, then do the intersect on values.
sample code:
val firstMap = Map(1 -> List(1,2,3,4,5))
val secondMap = Map(1 -> List(1,2,5))
val firstKeyRDD = sparkContext.parallelize(firstMap.toList, 2)
val secondKeyRDD = sparkContext.parallelize(secondMap.toList, 2)
val joinedRDD = firstKeyRDD.join(secondKeyRDD)
val finalResult = joinedRDD.map(tuple => {
val matchedLists = tuple._2
val intersectValues = matchedLists._1.intersect(matchedLists._2)
(tuple._1, intersectValues)
})
finalResult.foreach(println)
The output will be
(1,List(1, 2, 5))

Related

CAST jsonb column into INT[]

I'm in a situation where I get a jsonb value (from the scrape field which is jsonb) that looks like this:
SELECT COALESCE(scrape->'amenity_ids', '[]'::jsonb) AS ids
FROM my_table
ids |
-------------------------------------------------------------------------------------------------------------+
[] |
[33, 34, 35, 4, 5, 37, 8, 40, 9, 41, 42, 11, 44, 45, 46, 47, 16, 21, 56] |
[129, 35, 4, 36, 37, 103, 40, 41, 45, 77, 17, 23, 30] |
[1, 33, 34, 35, 4, 36, 8, 40, 41, 44, 45, 77, 46, 47, 85, 56, 90, 91, 92, 93, 30, 95] |
[1, 129, 2, 4, 8, 9, 77, 85, 89, 90, 91, 92, 93, 30, 94, 95, 96, 33, 34, 100, 37, 38, 40, 41, 44, 45, 46, 57]|
Note that there are NULL values in the jsonb object. So at this point ids is going to be of type jsonb and what I need is to have an array of integers as I'm trying to query for:
SELECT int_array_ids #> '{33,34,35}' FROM my_table;
Once I'm able to have a converted ids to INT[] I can create indexes to speed my array contains queries.
I tried a subquery using array_agg but it's terrible slow:
SELECT array_agg(arrayed.am_id) FROM (
SELECT
id,
jsonb_array_elements_text(scrape->'amenity_ids') AS am_id
FROM my_table
) AS arrayed
GROUP BY arrayed.id

I am getting an error Called "value % is not a member of scala.collection.immutable.Range.Inclusive" while filering

I am new to Scala, here i am trying to find the even numbers from 1 to 100, so while i am filtering,i am getting
scala.collection.immutable.Range.Inclusive
scala> var a = List(1 to 100)
a: List[scala.collection.immutable.Range.Inclusive] = List(Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))
scala> a.filter(x => (x % 2 == 0))
<console>:26: error: value % is not a member of scala.collection.immutable.Range.Inclusive
a.filter(x => (x % 2 == 0))
^
scala> val b = a.filter(x => x % 2 == 0)
<console>:25: error: value % is not a member of scala.collection.immutable.Range.Inclusive
val b = a.filter(x => x % 2 == 0)
^
You're creating a list of Range, not a list with the ints in that range. For that, change it to:
val a = (1 to 10).toList
But #Tim's right, you can filter directly on the Range
You don't need to wrap the Range in a List, just do this:
val a = 1 to 100
a.filter(x => x % 2 == 0)

infer type parameter from function argument

In the context of another stackoverflow question, I have this snippet:
def orderedGroupBy[T, P](seq: Traversable[T], f: T => P): Traversable[Tuple2[P, Traversable[T]]] = {
#tailrec
def accumulator(seq: Traversable[T], f: T => P, res: List[Tuple2[P, Traversable[T]]]): Traversable[Tuple2[P, Traversable[T]]] = seq.headOption match {
case None => res.reverse
case Some(h) => {
val key = f(h)
val subseq = seq.takeWhile(f(_) == key)
accumulator(seq.drop(subseq.size), f, (key -> subseq) :: res)
}
}
accumulator(seq, f, Nil)
}
I'd like to use it just like one can use .groupBy, e.g.:
orderedGroupBy(1 to 100, (_ / 10))
But the compiler yields an error about not having enough type info
<console>:10: error: missing parameter type for expanded function ((x$1) => x$1.$div(10))
orderedGroupBy(1 to 100, (_ / 10))
What is the idiomatic way to do this?
You can curry the parameters, so that T is inferred solely from seq: Traversable[T].
def orderedGroupBy[T, P](seq: Traversable[T])(f: T => P): Traversable[Tuple2[P, Traversable[T]]] = ???
scala> orderedGroupBy(1 to 100)(_ / 10)
res110: Traversable[(Int, Traversable[Int])] = List((0,Range(1, 2, 3, 4, 5, 6, 7, 8, 9)), (1,Range(10, 11, 12, 13, 14, 15, 16, 17, 18, 19)), (2,Range(20, 21, 22, 23, 24, 25, 26, 27, 28, 29)), (3,Range(30, 31, 32, 33, 34, 35, 36, 37, 38, 39)), (4,Range(40, 41, 42, 43, 44, 45, 46, 47, 48, 49)), (5,Range(50, 51, 52, 53, 54, 55, 56, 57, 58, 59)), (6,Range(60, 61, 62, 63, 64, 65, 66, 67, 68, 69)), (7,Range(70, 71, 72, 73, 74, 75, 76, 77, 78, 79)), (8,Range(80, 81, 82, 83, 84, 85, 86, 87, 88, 89)), (9,Range(90, 91, 92, 93, 94, 95, 96, 97, 98, 99)), (10,Range(100)))

Is it bad practice to populate this array using a for loop?

Please forgive me for asking what is probably a real beginners question. My search on google and stackoverflow didn't produce anything conclusive.
My array needs to contain the numbers 0 through 59. Here is a simple for loop to populate the array:
var timeArray = [0]
count = 1
while count < 60 {
timeArray.append(count)
count++
}
On the other hand, I could do this:
var timeArray = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
The second I guess is faster and maybe more readable. The first is maybe more concise.
What is general best practice in this case? Is there another, beter alternative?
Thanks.
Yes you are right the for loop will be slower that the second one.
I would use the second option but with slightly different syntax, just to save typings:
var timeArray = Array(0..<60)

Histogram from two vectors in Matlab

Thanks in advance for the help.
I have two sets of parallel vectors:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 55];
x_count = [7721, 6475, 3890, 2138, 1152, 784, 674, 492, 424, 365, 309, 302, 232, 250, 220, 208, 190, 162, 144, 134, 97, 93, 89, 97, 92, 85, 77, 87, 64, 75, 72, 82, 61, 48, 46, 44, 35, 20, 28, 20, 21, 10, 6, 8, 4, 4, 4, 3, 1, 1];
y = [1, 2, 3, 4, 5, 6, 7, 8, 9, 55];
y_count = [88, 40, 24, 12, 8, 5, 1, 1, 1, 100];
where x, y are the categories, and x_count, y_count are the frequency of each categories. x and y can be of unequal lengths, and need not contain the same categories.
I want to create a side-by-side bar/histogram plot, where the x-axis is the categories, placed side-by-side like this: side by side multiply histogram in matlab. The frequency counts go along the y-axis.
I've tried googling around, but still stuck on this. If someone could help, that would be great. The solution in side by side multiply histogram in matlab works only if x and y have the same length, but mine's not.
You can try this:
% create unique bins
bins = unique([x y]);
% create vectors with zeros same size as bins
xBins = zeros(size(bins));
yBins = zeros(size(bins));
% fill in counts in the respective spots
xBins(ismember(x, bins)) = x_count;
yBins(ismember(y, bins)) = y_count;
bar(bins, [xBins' yBins']);