how to convert map into repeated tuple in scala - scala

I have a method that accepts repeated tuples
myfn(attributes: (String, Any)*)
can I convert a map to a repeated tuple ?
for example
val m = Map ("a1"->1,"a2"->8) // convert to tuple
myfn(m1)

Sure, why not?
def function(myMap.toList(): _*)

Related

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

Scala method toLowerCase in spark

val file = sc.textFile(filePath)
val sol1=file.map(x=>x.split("\t")).map(x=>Array(x(4),x(5),x(1)))
val sol2=sol1.map(x=>x(2).toLowerCase)
In sol1, I have created an Rdd[Array[String]] and I want to put for every array the 3rd string element in LowerCase so call the method toLowerCase which should do that but instead it transform the string in lowercase char??
I assume you want to convert 3rd array element to lower case
val sol1=file.map(x=>x.split("\t"))
.map(x => Array(x(4),x(5),x(1).toLowerCase))
In your code, sol2 will be the sequence of string, not the sequence of array.

(Scala) How to convert List into a Seq

I have a function like
def myFunction(i:Int*) = i.map(a => println(a))
but I have a List of Int's.
val myList:List[Int] = List(1,2,3,4)
Desired output:
1
2
3
4
How can I programmatically convert myList so it can be inserted into myFunction?
Looking at your desired input and output, you want to pass a List where a varargs argument is expected.
A varargs method can receive zero or more arguments of same type.
The varargs parameter has type of Array[T] actually.
But you can pass any Seq to it, by using "varargs ascription/expansion" (IDK if there is an official name for this):
myFunction(myList: _*)

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Can reduceBykey be used to change type and combine values - Scala Spark?

In code below I'm attempting to combine values:
val rdd: org.apache.spark.rdd.RDD[((String), Double)] =
sc.parallelize(List(
(("a"), 1.0),
(("a"), 3.0),
(("a"), 2.0)
))
val reduceByKey = rdd.reduceByKey((a , b) => String.valueOf(a) + String.valueOf(b))
reduceByValue should contain (a , 1,3,2) but receive compile time error :
Multiple markers at this line - type mismatch; found : String required: Double - type mismatch; found : String
required: Double
What determines the type of the reduce function? Can the type not be converted?
I could use groupByKey to achieve same result but just want to understand reduceByKey.
No, given an rdd of type RDD[(K,V)], reduceByKey will take an associative function of type (V,V) => V.
If we want to apply a reduction that changes the type of the values to another arbitrary type, then we can use aggregateByKey:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)
Using the zeroValue and the seqOp function, it provides a fold-like operation at the map side while the associate function combOp combines the results of the seqOp to the final result, much like reduceByKey would do.
As we can appreciate from the signature, while the collection values are of type V the result of aggregateByKey will be of an arbitrary type U
Applied to the example above, aggregateByKey would look like this:
rdd.aggregateByKey("")({case (aggr , value) => aggr + String.valueOf(value)}, (aggr1, aggr2) => aggr1 + aggr2)
The problem with your code is that your Value type mismatch. You can achieve the same output with reduceByKey, provided you changed the type of value in your RDD.
val rdd: org.apache.spark.rdd.RDD[((String), String)] =
sc.parallelize(List(
("a", "1.0"),
("a", "3.0"),
("a", "2.0")
))
val reduceByKey = rdd.reduceByKey((a , b) => a.concat(b))
Here is the same example. As long as the function you pass to reduceByKey takes two parameter of the type Value( Double in your case ) and returns a single parameter of the same type, your reduceByKey will work.