toMap error in mapping a Data Set

toMap error in mapping a Data Set - scala

I'm having an error
error: Cannot prove that (Int, String, String, String, String, Double, String) <:< (T, U).
}.collect.toMap
when executing my application having the following code snippet.
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, fields(1),fields(2),fields(3),fields(4).toDouble,fields(5))
}.collect.toMap
What could be the cause and can anyone please suggest a solution ?

if you want to do toMap on Seq, you show have a Seq of Tuple2. ScalaDoc of toMap states :
This method is unavailable unless the elements are members of Tuple2,
each ((T, U)) becoming a key-value pair in the map
So you should do:
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, // first element of Tuple2 -> "key"
(fields(1),fields(2),fields(3),fields(4).toDouble,fields(5)) // 2nd element of Tuple2 -> "value"
)
}.collect.toMap
such that your map-statwment returns RDD[(Int, (String, String, String, String, Double, String))]

Related

Convert RDD[(String, String, String)] to RDD[(String, (String, String))] in Spark Scala

There are 2 rdds , which i am trying to join :
It's getting joined when there are 2 parameters in each rdd , however when i add a new parameter in existingGTINs rdd , i am facing below error:
Below is the code:
newGTS.collect()
(00070137115045,00070137115045)
(00799999150451,00799999150451)
existingGTS.collect()
(00799999150451,(00003306-808b-46da-bc7f-419c5ae223a7,2016-10-10 10:23:12.0))
(00016700000653,(00006d79-94ea-4651-be0c-0ce77958cd45,2021-05-31 01:20:39.291))
(00923846453024,(0000704b-b40d-4b9e-b266-f7c66723df0e,null))
(00610074049265,(0000a7a1-587c-4b13-a155-7846df82fdee,2020-03-20 12:16:55.873))
(00034100516079,(0002495f-6084-49dd-aadb-20cd137d9694,null))
val join1 = newGTINs.leftOuterJoin(existingGTINs) mapValues {
case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
}
error: constructor cannot be instantiated to expected type;
found : (T1, T2, T3)
required: (String, Option[(String, String)])
case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
^
PS: UUID.randomUUID.toString --> this function is to creatre a random id

I am gussing that newGTINs and existingGTINs used in join are supposed to be same as newGTS and existingGTS shown with collects.
Since your newGTSINs looks to be a RDD[(String, String)] and existingGTINS is a RDD[(String, (String, String))], your newGTINs.leftOuterJoin(existingGTINs) will be a RDD[(String,(String, Option[(String, String)]))].
Which means that your mapValues will expect a function (String, Option[(String, String)]) => SomeNewType or as a parameter. It can also accept a partial function satisfying the similar type semantics.
But your { case (gtin, iUUID, createDt) => (iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString)) } is a partial function which corresponds to type (String, String, String) => SomeNewType.
Notice the difference, hence the error. You can fix this by providing appropriate partial function to statisfy the mapValues requirement.
val join1 =
newGTINs
.leftOuterJoin(existingGTINs)
.mapValues {
case (gtin, Some(iUUID, createDt)) =>
(iUUID.isEmpty, iUUID.getOrElse(UUID.randomUUID.toString))
case (gtin, None) =>
// what heppens for gtins without matching element in existing one's
(true, UUID.randomUUID.toString)
}

Using Flink to get Counts Within a Keyed Window

I'm using Flink via the Scala interface to do some data processing. I have some user data that comes in tuples:
(user1, "titanic")
(user1, "titanic")
(user1, "batman")
(user2, "star wars")
(user2, "star wars")
(user2, "batman")
I want to key by the user, create a window and then count the number of times that a user has viewed a particular movie within that window, so that I end up with a Map from each movie to the number of view counts for each user. For example, for user1, the correct output is Map("titanic" -> 2, "batman" -> 1).
I know that the first part of my code should look something like this:
keyedStream.keyBy(0).window(EventTimeSessionWindows.withGap(Time.minutes(10)))
But I don't know how to do a further aggregation within the window so that I end up with a Map of view counts for each user/window. I've attempted to write my own AggregateFunction that collects these counts into a mutable Map but unfortunately a mutable Map is not serializable, so it fails.
How might I do this?

You should be able to solve the problem by using an AggregateFunction:
source
.keyBy(0)
.timeWindow(Time.seconds(10L))
.aggregate(new AggregateFunction[(String, String), (String, Map[String, Int]), (String, Map[String, Int])] {
override def createAccumulator(): (String, Map[String, Int]) = ("", Map())
override def add(value: (String, String), accumulator: (String, Map[String, Int])): (String, Map[String, Int]) = {
val counter = accumulator._2.getOrElse(value._2, 0)
(value._1, accumulator._2 + (value._2 -> (counter + 1)))
}
override def getResult(accumulator: (String, Map[String, Int])): (String, Map[String, Int]) = accumulator
override def merge(a: (String, Map[String, Int]), b: (String, Map[String, Int])): (String, Map[String, Int]) = {
(a._1, (a._2.keySet ++ b._2.keySet) map (k => k -> (a._2.getOrElse(k, 0) + b._2.getOrElse(k, 0))) toMap)
}
})

spark rdd, need to reduce over (key,(tuple))

A previous process gave me the accumulator and count of every group in the next way:
val data: Array[(Int, (Double, Int))] = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
the structure is (groupId,(acum,count))
now want reduce to get the sum of every tuple:
(k1,(a1,n1)),(k2,(a2,n2))
need:
(a1+a2),(n1+n2)
Sound like a simple task, So do:
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
And get:
:33: error: type mismatch; found : (Double, Int)
required: String
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
Also tried:
val mainMean = groups.reduce((k,(acc,v))=>(acc._1 + v._1,acc._2 + v._2))
and tell me: Note: Tuples cannot be directly destructured in method or function parameters.
Either create a single parameter accepting the Tuple2,
or consider a pattern matching anonymous function
So:
val mainMean = groups.reduce({case(k,(acc,n))=>(k,(acc._1+n._1,acc._1+n._2))})
and get
error: type mismatch; found : (Int, (Double, Int)) required: Int
I know it a newbie question but I am stuck on it

There can be some difficulties working with tuples.
Below you can see working code, but let me explain.
val data = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
def tupleSum(t1: (Int, (Double, Int)), t2: (Int, (Double, Int))): (Int, (Double, Int)) =
(0,(t1._2._1 + t2._2._1, t1._2._2 + t2._2._2))
val mainMean = data.reduce(tupleSum)._2
We can introduce reduce arguments like
data.reduce((tuple1, tuple2) => tupleSum(tuple1, tuple2))
where tuple1 is kind of accumulator. On the first iteration it takes the first value of the array, and every next value adds to the value of accumulator.
So if you want to perform reduce using pattern matching it will look like this:
val mainMean = data.reduce((tuple1, tuple2) => {
val t1 = tuple1 match { case (i, t) => t }
val t2 = tuple2 match { case (i, t) => t }
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
UPD.
I rewrite previous listing adding type annotations and println statements. I hope it will help to get the point. And there is some explanation after.
val data = Array((3, (3.0, 3)), (2,(2.0,2)), (1,(1.0,1)))
val mainMean = data.reduce((tuple1: (Int, (Double, Int)),
tuple2: (Int, (Double, Int))) => {
println("tuple1: " + tuple1)
println("tuple2: " + tuple2)
val t1: (Double, Int) = tuple1 match {
case (i: Int, t: (Double, Int)) => t
}
val t2: (Double, Int) = tuple2 match {
case (i: Int, t: (Double, Int)) => t
}
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
println("mainMean: " + mainMean)
And the output will be:
tuple1: (3,(3.0,3)) // 1st element of the array
tuple2: (2,(2.0,2)) // 2nd element of the array
tuple1: (0,(5.0,5)) // sum of 1st and 2nd elements
tuple2: (1,(1.0,1)) // 3d element
mainMean: (0,(6.0,6)) // result sum
tuple1 and tuple2 type is (Int, (Double, Int)). We know it always be only this type, that is why we use only one case in pattern matching. We unpack tuple1 to i: Int and t: (Int, Double). As far as we are not interested in key, we return only t. Now t1 is representing the inner tuple of tuple1. The same story with tuple2 andt2.
You can find more information about fold functions here and here

In Scala, only add items to Map if Optional values are present

I'm new to scala and I'm trying to do something like this in a clean way.
I have a method that takes in several optional parameters. I want to create a map and only add items to the map if the optional parameter has a value. Here's a dummy example:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
val someMap: Map[String, String] =
value1.map(i => Map("KEY_1_NAME" -> i.toString)).getOrElse(Map())
}
In this case above, we're kind of doing what I want but only if we only care about value1 - I would want this done for all of the optional values and have them put into the map. I know I can do something brute-force:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
// create Map
// if value1 has a value, add to map
// if value2 has a value, add to map
// ... etc
}
but I was wondering if scala any features that would help me able to clean this up.
Thanks in advance!

You can create a Map[String, Option[String]] and then use collect to remove empty values and "extract" the present ones from their wrapping Option:
def makeSomething(value1: Option[String], value2: Option[String], value3: Option[String]): SomeClass = {
val someMap: Map[String, String] =
Map("KEY1" -> value1, "KEY2" -> value2, "KEY3" -> value3)
.collect { case (key, Some(value)) => key -> value }
// ...
}

.collect is one possibility. Alternatively, use the fact that Option is easily convertible to a Seq:
value1.map("KEY1" -> _) ++
value2.map("KEY2" -> _) ++
value3.map("KEY3" -> _) toMap

How to order tuples based on another, referal tuple

The following Iterable can be o size one, two, or (up to) three.
org.apache.spark.rdd.RDD[Iterable[(String, String, String, String, Long)]] = MappedRDD[17] at map at <console>:75
The second element of each tuple can have any of the following values: A, B, C. Each of these values can appear (at most) once.
What I would like to do is sort them based on the following order (B , A , C), and then create a string by concatenating the elements of the 3rd place. If the corresponding tag is missing then concatenate with a blank space: ``. For example:
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,B,val2,blah,blah), (blah,C,val3,blah,blah))
should result in:
val2,val1,val3
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,C,val3,blah,blah))
should result in:
,val1,val3
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,B,val2,blah,blah))
should result in:
val2,val1,
this:
CompactBuffer((blah,B,val2,blah,blah))
should result in:
val2,,
and so on so forth.

In your case when A, B and C appear at most once, you could add the corresponding values to a temporary map and retrieve the values from the map in the correct order.
If we use getOrElse to get the values from the map, we can specify the empty string as default value. This way we still get the correct result if our Iterable doesn't contain all the tuples with A, B and C.
type YourTuple = (String, String, String, String, Long)
def orderTuples(order: List[String])(iter: Iterable[YourTuple]) = {
val orderMap = iter.map { case (_, key, value, _, _) => key -> value }.toMap
order.map(s => orderMap.getOrElse(s, "")).mkString(",")
}
We can use this function as follows :
val a = ("blah","A","val1","blah",1L)
val b = ("blah","B","val2","blah",2L)
val c = ("blah","C","val3","blah",3L)
val order = List("B", "A", "C")
val bacOrder = orderTuples(order) _
bacOrder(Iterable(a, b, c)) // String = val2,val1,val3
bacOrder(Iterable(a, c)) // String = ,val1,val3
bacOrder(Iterable(a, b)) // String = val2,val1,
bacOrder(Iterable(b)) // String = val2,,

def orderTuples(xs: Iterable[(String, String, String, String, String)],
order: (String, String, String) = ("B", "A", "C")) = {
type T = Iterable[(String, String, String, String, String)]
type KV = Iterable[(String, String)]
val ord = List(order._1, order._2, order._3)
def loop(xs: T, acc: KV, vs: Iterable[String] = ord): KV = xs match {
case Nil if vs.isEmpty => acc
case Nil => vs.map((_, ",")) ++: acc
case x :: xs => loop(xs, List((x._2, x._3)) ++: acc, vs.filterNot(_ == x._2))
}
def comp(x: String) = ord.zipWithIndex.toMap.apply(x)
loop(xs, Nil).toList.sortBy(x => comp(x._1)).map(_._2).mkString(",")
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

toMap error in mapping a Data Set - scala

Related

Convert RDD[(String, String, String)] to RDD[(String, (String, String))] in Spark Scala

Using Flink to get Counts Within a Keyed Window

spark rdd, need to reduce over (key,(tuple))

In Scala, only add items to Map if Optional values are present

How to order tuples based on another, referal tuple

Categories

Resources