combine elements into arrays in rdd - scala

how can I convert an RDD[(Int,Int)] to an RDD[Array[(Int,Int)]] where I combine elements with their key.
Lets say
(0,0),(1,0),(1,1),(0,1)
and I want it to be an Array arr1 = ((0,0),(1,0)) and an arr2 ((1,1),(0,1))
So the resulted rdd will have arr1,arr2 as arrays.

What you're basically trying to do is group an RDD[TupleN] by the ith element. You can use
rdd.groupBy(_._1)
to create a
Map[T, RDD[TupleN]]
where the key will be the ith element (i.e., 0 or 1 in your example).
Then you can map the values of this map to an array with mapValues(_.toArray)

Related

Unique pairs in cartesian product of an RDD with itself

I have an RDD with 100 items and each item is of the format (String,Int,Option[Set[Int]])
I want to compute the cartesian product of this RDD with itself and want to have only unique pairs. For example without removing unique pairs I would get something like this:
(a,a),(a,b),(a,c),(b,a),(b,b),(b,c),(c,a),(c,b),(c,c)
The output that I want is (a,b),(a,c),(b,c)
I have managed to remove the instances where the pairs are the same value (a,a),(b,b),(c,c) but I am unsure how to remove the reverse pairs.
This is the code I used to do it:
val m = records.cartesian(records).filter{case (a,b) => a != b}.collect()

is there a way in scala i can form a map based on the content of two arrays

I have two scala vectors
value - represents values
occurrences - represents number of occurrences of a character
Ex:
values : (a,b,c,d)
occurrences : (4,2,1,5)
How can I merge this two vectors into Map of form Map[Char,Int)] = Map(a -> 4,b->2,c->1,d->5)
(values zip occurrences).toMap
zip pairs up matching elements in two collections, and toMap converts a collection of pairs into a Map.

how to accumulate 2D array elements in Scala?

I have 2D array as:
Array(Array(1,1,0), Array(1,0,1))
and I would like to accumulate values over column so my final output look like
Array(Array(1,1,0), Array(2,1,1))
If this is 1D array, I can simply use 'scan' but I'm having trouble with using scan in 2D array.
can anyone help on this issue?
Here's one way to do it:
val t = Array(Array(1,1,0), Array(1,0,1))
val result = t.scanLeft(Array.fill(t(0).length)(0)) ((x,y) =>
x.zip(y).map(e => e._1 + e._2)).drop(1)
//to see the results
result.foreach(e => println(e.toList))
gives:
List(1, 1, 0)
List(2, 1, 1)
The idea is to create an array filled with zeros (using Array.fill) and then scan the 2D array using that as an accumulator. In the end, drop(1) gets rid of the zero-filled array.
EDIT:
In response to the comment, this solution works for a matrix of any size. The zip function takes care of element-wise addition.
EDIT 2 (Step by step explanation):
You already know about scan or a one-dimensional array. The idea is essentially the same.
We initialize the accumulator with zero. In this case, zero means an array of zeros. Array.fill is used to create an array filled with zeros.
Instead of a single addition, we need to add arrays element-wise. This is what the combination of zip and map do. There are a lot of examples available on the Internet about how these methods work.
Finally, we drop the zero element using Scala's drop(1). The result is an array of arrays containing accumulated values.
I would solve it as for given row r, sum all previous rows.
val accumatedMatrix =
for(row <- array.indices)
yield array.take(row + 1).foldLeft(Array(0, 0, 0)) {
case (a, b) => Array(a(0) + b(0), a(1) + b(1), a(2) + b(2))
}
input:
val array = Array(
Array(1,1,0),
Array(1,0,1)
)
output:
1,1,0
2,1,1
Instead of summing all the previous rows repeatedly you can improve it to memoize as well.
Pretty much the same approach as a 1D array:
a.tail.scan(a.head)((acc, value) =>
Array(acc(0) + value(0), acc(1) + value(1), acc(2) + value(2))
)
For arbitrary number (as long as they all have the same number):
a.tail.scan(a.head)((acc, value) =>
acc zip value map {case (a,b) => a+b}
)

Is there a way to filter out the elements of a List by checking them against elements of an Array in Scala?

I have a List in Scala:
val hdtList = hdt.split(",").toList
hdtList.foreach(println)
Output:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string
There is an array which is obtained from a dataframe and converting its column to array as below:
val partition_columns = spColsDF.select("partition_columns").collect.flatMap(x => x.getAs[String](0).split(","))
partition_columns.foreach(println)
Output:
source_system_name
period_year
Is there a way to filter out the elements: source_system_name string, period_year bigint from hdtList by checking them against the elements in the Array: partition_columns and put them into new List.
I am confused on applying filter/map on the right collections appropriately and compare them.
Could anyone let me know how can I achieve that ?
Unless I'm misunderstanding the question, I think this is what you need:
val filtered = hdtList.filter { x =>
!partition_columns.exists { col => x.startsWith(col) }
}
In your case you need to use filter, because you need to remove elements from hdtList.
Map is a function that transform elements, there is no way to remove elements from a collection using map. If you have a List of X elements, after map execution, you have X elements, not less, not more.
val newList = hdtList.filter( x => partition_columns.exists(x.startsWith) )
Be aware that the combination filter+exists between two List is an algorithm NxM. If your Lists are big, you will have a performance problem.
One way to solve that problem is using Sets.
It might be useful to have both lists: the hdt elements referenced in partition_columns, and the hdt elements that aren't.
val (pc
,notPc) = hdtList.partition( w =>
partition_columns.contains(w.takeWhile(_!=' ')))
//pc: List[String] = List(period_year bigint, source_system_name string)
//notPc: List[String] = List(forecast_id bigint, period_num bigint, ... etc.

How to sort an ArrayBuffer[Double] and save indices

Given an ArrayBuffer[Double], how can sort its elements with maintaining also their indices, e.g.
val arr ArrayBuffer[Double] = ArrayBuffer(4,5.3,5,3,8,9)
the result has to be:
arrSorted = ArrayBuffer(3,4,5,5.3,8,9)
indices = Arraybuffer(3,0,2,1,4,5) //here the data structure doesn't matter, it can be Array, List, Vector, etc.
Thanks
This is a one-liner:
val (addSorted, indices) = arr.zipWithIndex.sorted.unzip
Going step by step, zipWithIndex creates a collection of tuples with the index as the second value in each tuple:
scala> println(arr.zipWithIndex)
ArrayBuffer((4.0,0), (5.3,1), (5.0,2), (3.0,3), (8.0,4), (9.0,5))
sorted sorts these tuples lexicographically (which is almost certainly what you want, but you could also use sortBy(_._1) to be explicit about the fact that you want to sort by the values):
scala> println(arr.zipWithIndex.sorted)
ArrayBuffer((3.0,3), (4.0,0), (5.0,2), (5.3,1), (8.0,4), (9.0,5))
unzip then turns this collection of tuples into a tuple of collections, which you can deconstruct with val (addSorted, indices) = ....