Basically this question only for Scala.
How can I do the following transformation given an RDD with elements of the form
(List[String], String) => (String, String)
e.g.
([A,B,C], X)
([C,D,E], Y)
to
(A, X)
(B, X)
(C, X)
(C, Y)
(D, Y)
(E, Y)
So
scala> val l = List((List('a, 'b, 'c) -> 'x), List('c, 'd, 'e) -> 'y)
l: List[(List[Symbol], Symbol)] = List((List('a, 'b, 'c),'x),
(List('c, 'd, 'e),'y))
scala> l.flatMap { case (innerList, c) => innerList.map(_ -> c) }
res0: List[(Symbol, Symbol)] = List(('a,'x), ('b,'x), ('c,'x), ('c,'y),
('d,'y), ('e,'y))
With Spark you can solve your problem with:
object App {
def main(args: Array[String]) {
val input = Seq((List("A", "B", "C"), "X"), (List("C", "D", "E"), "Y"))
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(input)
val result = rdd.flatMap {
case (list, label) => {
list.map( (_, label))
}
}
result.foreach(println)
}
}
This will output:
(C,Y)
(D,Y)
(A,X)
(B,X)
(E,Y)
(C,X)
I think that the RDD flatMapValues suits this case best.
val A = List((List(A,B,C),X),(List(A,B,C),Y))
val rdd = sc.parallelize(A)
val output = rdd.map(x=>(x._2,x._1)).flatMapValues(x=>x)
which will map X with every value in the List(A,B,C) resulting in RDD of pairs of RDD[(X,A),(X,B),(X,C)...(Y,A),(Y,B),(Y,C)]
val l = (List(1, 2, 3), "A")
val result = l._1.map((_, l._2))
println(result)
Will give you:
List((1,A), (2,A), (3,A))
Using beautiful for comprehensions and making the parameters generic
def convert[F, S](input: (List[F], S)): List[(F, S)] = {
for {
x <- input._1
} yield {
(x, input._2)
}
}
a sample call
convert(List(1, 2, 3), "A")
will give you
List((1,A), (2,A), (3,A))
Related
val adjList = Map("Logging" -> List("Networking", "Game"))
// val adjList: Map[String, List[String]] = Map(Logging -> List(Networking, Game))
adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }.toList
// val res7: List[(String, String)] = List((Logging,Game))
adjList.map { case (v, vs) => vs.map(n => (v, n)) }.flatten.toList
// val res8: List[(String, String)] = List((Logging,Networking), (Logging,Game))
I am not sure what is happening here. I was expecting the same result from both of them.
.flatMap is Map's .flatMap, but .map is Iterable's .map.
For a Map "Logging" -> "Networking" and "Logging" -> "Game" become just the latter "Logging" -> "Game" because the keys are the same.
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val x0: Map[String, String] = adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }
//Map(Logging -> Game)
val x: List[(String, String)] = x0.toList
//List((Logging,Game))
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val y0: immutable.Iterable[List[(String, String)]] = adjList.map { case (v, vs) => vs.map(n => (v, n)) }
//List(List((Logging,Networking), (Logging,Game)))
val y1: immutable.Iterable[(String, String)] = y0.flatten
//List((Logging,Networking), (Logging,Game))
val y: List[(String, String)] = y1.toList
//List((Logging,Networking), (Logging,Game))
Also https://users.scala-lang.org/t/map-flatten-flatmap/4180
I have the following:
val x : List[(String, Int)] = List((mealOne,2), (mealTWo,1), (mealThree,2))
I want to replace or transform the String to Int using the below values with a map:
val mealOne = 5.99; val mealTwo = 6.99; val mealThree = 7.99
x.map{ x => if (x._1 == "mealOne") mealOne
else if (x._1 == "mealTwo") mealTwo
else mealThree
}
Result:
List[Double] = List(5.99, 6.99, 7.99)
but I want this:
List[Double,Int] = List((5.99,2), (6.99,1), (7.99,2))
So how can I achieve the above
Thanks
Just don't drop the second component of the tuple then:
x.map{ x => (
if (x._1 == "mealOne") mealOne
else if (x._1 == "mealTwo") mealTwo
else mealThree,
x._2
)}
of course, it works for arbitrary mappings from Strings to Doubles:
def replaceNamesByPrices(
nameToPrice: String => Double,
xs: List[(String, Int)]
): List[(Double, Int)] =
for ((name, amount) <- xs) yield (nameToPrice(name), amount)
so that you can then store the mapping of names to prices in a map, i.e.
val priceTable = Map(
"mealOne" -> 42.99,
"mealTwo" -> 5.99,
"mealThree" -> 2345.65
)
so that
replaceNamesByPrices(priceTable, x)
yields the desired result.
This works in this way:(still simplified, thanks to Andrey Tyukin):
for(m<-x) yield (y(m._1),m._2)
for((m,n)<-x) yield (y(m),n)
or
x.map(t=>(y(t._1),t._2))
x.map{case (m,n)=>(y(m),n)}
Your Lists ( input ):(y is changed to Map)
val x = List(("mealOne",2), ("mealTWo",1), ("mealThree",2))
val y = Map(("mealOne",5.99), ("mealTWo",6.99), ("mealThree",7.99))
In Scala REPL:
scala> for(m<-x) yield (y(m._1),m._2)
res35: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> for((m,n)<-x) yield (y(m),n)
res60: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> x.map(t=>(y(t._1),t._2))
res57: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> x.map{case (m,n)=>(y(m),n)}
res59: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2)
)
If the format of the input is
(x1,(a,b,c,List(key1, key2))
(x2,(a,b,c,List(key3))
and I would like to achieve this output
(key1,(a,b,c,x1))
(key2,(a,b,c,x1))
(key3,(a,b,c,x2))
Here is the code:
var hashtags = joined_d.map(x => (x._1, (x._2._1._1, x._2._2, x._2._1._4, getHashTags(x._2._1._4))))
var hashtags_keys = hashtags.map(x => if(x._2._4.size == 0) (x._1, (x._2._1, x._2._2, x._2._3, 0)) else
x._2._4.map(y => (y, (x._2._1, x._2._2, x._2._3, 1))))
The function getHashTags() returns a list. If the list is not empty, we want to use each elements in the list as the new key. How should i work around this issue?
With rdd created as:
val rdd = sc.parallelize(
Seq(
("x1",("a","b","c",List("key1", "key2"))),
("x2", ("a", "b", "c", List("key3")))
)
)
You can use flatMap like this:
rdd.flatMap{ case (x, (a, b, c, list)) => list.map(k => (k, (a, b, c, x))) }.collect
// res12: Array[(String, (String, String, String, String))] =
// Array((key1,(a,b,c,x1)),
// (key2,(a,b,c,x1)),
// (key3,(a,b,c,x2)))
Here's one way to do it:
val rdd = sc.parallelize(Seq(
("x1", ("a", "b", "c", List("key1", "key2"))),
("x2", ("a", "b", "c", List("key3")))
))
val rdd2 = rdd.flatMap{
case (x, (a, b, c, l)) => l.map( (_, (a, b, c, x) ) )
}
rdd2.collect
// res1: Array[(String, (String, String, String, String))] = Array((key1,(a,b,c,x1)), (key2,(a,b,c,x1)), (key3,(a,b,c,x2)))
I'm running a left join in a Spark RDD but sometimes I get an output like this:
(k, (v, Some(w)))
or
(k, (v, None))
how do I make it so it give me back just
(k, (v, (w)))
or
(k, (v, ()))
here is how I'm combining 2 files..
def formatMap3(
left: String = "", right: String = "")(m: String = "") = {
val items = m.map{k => {
s"$k"}}
s"$left$items$right"
}
val combPrdGrp = custPrdGrp3.leftOuterJoin(cmpgnPrdGrp3)
val combPrdGrp2 = combPrdGrp.groupByKey
val combPrdGrp3 = combPrdGrp2.map { case (n, list) =>
val formattedPairs = list.map { case (a, b) => s"$a $b" }
s"$n ${formattedPairs.mkString}"
}
If you're just interesting in getting formatted output without the Somes/Nones, then something like this should work:
val combPrdGrp3 = combPrdGrp2.map { case (n, list) =>
val formattedPairs = list.map {
case (a, Some(b)) => s"$a $b"
case (a, None) => s"$a, ()"
}
s"$n ${formattedPairs.mkString}"
}
If you have other uses in mind then you probably need to provide more details.
The leftOuterJoin() function in Spark returns the tuples containing the join key, the left set's value and an Option of the right set's value. To extract from the Option class, simply call getOrElse() on the right set's value in the resultant RDD. As an example:
scala> val rdd1 = sc.parallelize(Array(("k1", 4), ("k4", 7), ("k8", 10), ("k6", 1), ("k7", 4)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:21
scala> val rdd2 = sc.parallelize(Array(("k5", 4), ("k4", 3), ("k0", 2), ("k6", 5), ("k1", 6)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at parallelize at <console>:21
scala> val rdd_join = rdd1.leftOuterJoin(rdd2).map { case (a, (b, c: Option[Int])) => (a, (b, (c.getOrElse()))) }
rdd_join: org.apache.spark.rdd.RDD[(String, (Int, AnyVal))] = MapPartitionsRDD[18] at map at <console>:25'
scala> rdd_join.take(5).foreach(println)
...
(k4,(7,3))
(k6,(1,5))
(k7,(4,()))
(k8,(10,()))
(k1,(4,6))
What's the idiomatic way to call map over a collection producing 0 or 1 result per entry?
Suppose I have:
val data = Array("A", "x:y", "d:e")
What I'd like as a result is:
val target = Array(("x", "y"), ("d", "e"))
(drop anything without a colon, split on colon and return tuples)
So in theory I think I want to do something like:
val attempt1 = data.map( arg => {
arg.split(":", 2) match {
case Array(l,r) => (l, r)
case _ => (None, None)
}
}).filter( _._1 != None )
What I'd like to do is avoid the need for the any-case and get rid of the filter.
I could do this by pre-filtering (but then I have to test the regex twice):
val attempt2 = data.filter( arg.contains(":") ).map( arg => {
val Array(l,r) = arg.split(":", 2)
(l,r)
})
Last, I could use Some/None and flatMap...which does get rid of the need to filter, but is it what most scala programmers would expect?
val attempt3 = data.flatMap( arg => {
arg.split(":", 2) match {
case Array(l,r) => Some((l,r))
case _ => None
}
})
It seems to me like there'd be an idiomatic way to do this in Scala, is there?
With a Regex extractor and collect :-)
scala> val R = "(.+):(.+)".r
R: scala.util.matching.Regex = (.+):(.+)
scala> Array("A", "x:y", "d:e") collect {
| case R(a, b) => (a, b)
| }
res0: Array[(String, String)] = Array((x,y), (d,e))
Edit:
If you want a map, you can do:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) }.toMap
x: Map[String,String] = Map(x -> y, d -> e)
If performance is a concern, you can use collection.breakOut as shown below to avoid creation of an intermediate array:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) } (collection.breakOut)
x: Map[String,String] = Map(x -> y, d -> e)