I have a function in scala which I send arguments to, I use it like this:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, f(v))}
My function f is:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
def f(v: Array[String]): Int = {
val parsedDates = v.map(LocalDate.parse(_, formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear}
And this is the error I get:
found : Iterable[Array[String]]
required: Array[String]
I already tried using:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, for (date <- v) f(date))}
But I get massive errors.
Just to get a better picture, data in concat is:
1974,1974-06-22
1966,1966-07-20
1954,1954-06-19
1994,1994-06-27
1954,1954-06-26
2006,2006-07-04
2010,2010-07-07
1990,1990-06-30
...
It is type RDD[String].
How can I properly iterate over that and get a single Int from that function f?
The RDD types alongside your pipeline are:
concat.map(_.split(",")) gives an RDD[Array[String]]
for instance Array("1954", "1954-06-19")
concat.map(_.split(",")).keyBy(_(0)) gives RDD[(String, Array[String])]
for instance ("1954", Array("1954", "1954-06-19"))
concat.map(_.split(",")).keyBy(_(0)).groupByKey() gives RDD[(String, Iterable[Array[String]])]
for instance Iterable(("1954", Iterable(Array("1954", "1954-06-19"), Array("1954", "1954-06-24"))))
Thus when you map at the end, the type of values is Iterable[Array[String]].
Since your input is "1974,1974-06-22", the solution could consist in replacing your keyBy transformation by a map:
input.map(_.split(",")).map(x => x(0) -> x(1)).groupByKey().map{case (k, v) => (k, f(v))}
Indeed, .map(x => x(0) -> x(1)) (instead of .map(x => x(0) -> x) whose keyBy(_(0)) is syntactic sugar for) will provide for the value the second element of the split array instead of the array itself. Thus giving RDD[(String, String)] during this second step rather than RDD[(String, Array[String])].
Related
My goal is to to map every word in a text (Index, line) to a list containing the indices of every line the word occurs in. I managed to write a function that returns a list of all words assigned to a index.
The following function should do the rest (map a list of indices to every word):
def mapIndicesToWords(l:List[(Int,String)]):Map[String,List[Int]] = ???
If I do this:
l.groupBy(x => x._2)
it returns a Map[String, List[(Int,String)]. Now I just want to change the value to type List[Int].
I thought of using .mapValues(...) and fold the list somehow, but I'm new to scala and don't know the correct approach for this.
So how do I convert the list?
Also you can use foldLeft, you need just specify accumulator (in your case Map[String, List[Int]]), which will be returned as a result, and write some logic inside. Here is my implementation.
def mapIndicesToWords(l:List[(Int,String)]): Map[String,List[Int]] =
l.foldLeft(Map[String, List[Int]]())((map, entry) =>
map.get(entry._2) match {
case Some(list) => map + (entry._2 -> (entry._1 :: list))
case None => map + (entry._2 -> List(entry._1))
}
)
But with foldLeft, elements of list will be in reversed order, so you can use foldRight. Just change foldLeft to foldRight and swap input parameters, (map, entry) to (entry, map).
And be careful, foldRight works 2 times slower. It is implemented using method reverse list and foldLeft.
scala> val myMap: Map[String,List[(Int, String)]] = Map("a" -> List((1,"line1"), (2, "line")))
myMap: Map[String,List[(Int, String)]] = Map(a -> List((1,line1), (2,line)))
scala> myMap.mapValues(lst => lst.map(pair => pair._1))
res0: scala.collection.immutable.Map[String,List[Int]] = Map(a -> List(1, 2))
I am trying some basic logic using scala . I tried the below code but it throws error .
scala> val data = ("HI",List("HELLO","ARE"))
data: (String, List[String]) = (HI,List(HELLO, ARE))
scala> data.flatmap( elem => elem)
<console>:22: error: value flatmap is not a member of (String, List[String])
data.flatmap( elem => elem)
Expected Output :
(HI,HELLO,ARE)
Could some one help me to fix this issue?
You are trying to flatMap over a tuple, which won't work. The following will work:
val data = List(List("HI"),List("HELLO","ARE"))
val a = data.flatMap(x => x)
This will be very trivial in scala:
val data = ("HI",List("HELLO","ARE"))
println( data._1 :: data._2 )
what exact data structure are you working with?
If you are clear about you data structure:
type rec = (String, List[String])
val data : rec = ("HI",List("HELLO","ARE"))
val f = ( v: (String, List[String]) ) => v._1 :: v._2
f(data)
A couple of observations:
Currently there is no flatten method for tuples (unless you use shapeless).
flatMap cannot be directly applied to a list of elements which are a mix of elements and collections.
In your case, you can make element "HI" part of a List:
val data = List(List("HI"), List("HELLO","ARE"))
data.flatMap(identity)
Or, you can define a function to handle your mixed element types accordingly:
val data = List("HI", List("HELLO","ARE"))
def flatten(l: List[Any]): List[Any] = l.flatMap{
case x: List[_] => flatten(x)
case x => List(x)
}
flatten(data)
You are trying to flatMap on Tuple2 which is not available in current api
If you don't want to change your input, you can extract the values from Tuple2 and the extract the values for second tuple value as below
val data = ("HI",List("HELLO","ARE"))
val output = (data._1, data._2(0), data._2(1))
println(output)
If that's what you want:
val data = ("HI",List("HELLO,","ARE").mkString(""))
println(data)
>>(HI,HELLO,ARE)
I am looking for a way to join two list of tuples in scala to get same result than Apache spark gives me using join function.
Example:
Having two list of tuples such us:
val l1 = List((1,1),(1,2),(2,1),(2,2))
l1: List[(Int, Int)] = List((1,1), (1,2), (2,1), (2,2))
val l2 = List((1,(1,2)), (2,(2,3)))
l2: List[(Int, (Int, Int))] = List((1,(1,2)), (2,(2,3)))
What is the best way to join by key both list to get the following result?
l3: List[(Int,(Int,(Int,Int)))] = ((1,(1,(1,2))),(1,(2,(1,2))),(2,(1,(2,3))),(2,(2,(2,3))))
You can use a for comprehension and take advantage of using the '`' in the pattern matching. That is, it will match only when keys from the first list are the same with the ones in the second list ("`k`" means the key in the tuple must be equal to the value of k).
val res = for {
(k, v1) <- l1
(`k`, v2) <- l2
} yield (k, (v1, v2))
I hope you find this helpful.
You might want do do something like this:
val l3=l1.map(tup1 => l2.filter(tup2 => tup1._1==tup2._1).map(tup2 => (tup1._1, (tup1._2, tup2._2)))).flatten
It Matches the same Indexes, creates sublists and then combines the list of lists with the flatten-command
This results to:
List((1,(1,(1,2))), (1,(2,(1,2))), (2,(1,(2,3))), (2,(2,(2,3))))
Try something like this:
val l2Map = l2.toMap
val l3 = l1.flatMap { case (k, v1) => l2Map.get(k).map(v2 => (k, (v1, v2))) }
what can be rewritten to more general form using implicits:
package some.package
import scala.collection.TraversableLike
import scala.collection.generic.CanBuildFrom
package object collection {
implicit class PairTraversable[K, V, C[A] <: TraversableLike[A, C[A]]](val seq: C[(K, V)]) {
def join[V2, C2[A] <: TraversableLike[A, C2[A]]](other: C2[(K, V2)])
(implicit canBuildFrom: CanBuildFrom[C[(K, V)], (K, (V, V2)), C[(K, (V, V2))]]): C[(K, (V, V2))] = {
val otherMap = other.toMap
seq.flatMap { case (k, v1) => otherMap.get(k).map(v2 => (k, (v1, v2))) }
}
}
}
and then simply:
import some.package.collection.PairTraversable
val l3 = l1.join(l2)
This solution converts second sequence to map (so it consumes some additional memory), but is much faster, than solutions in other answers (compare it for large collections, e.g. 10000 elements, on my laptop it is 5ms vs 2500ms).
Little late. This solution will give you back the original size of l1 and return Option(None) for missing values in l2. (Left join instead of inner join)
val m2 = l2.map{ case(k,v) => (k -> v)}.toMap
val res2 = l1.map { case(k,v) =>
val v2 = m2.get(k)
(k, (v, v2))
}
I have a PartialFunction[String,String] and a Map[String,String].
I want to apply the partial functions on the map values and collect the entries for which it was applicaple.
i.e. given:
val m = Map( "a"->"1", "b"->"2" )
val pf : PartialFunction[String,String] = {
case "1" => "11"
}
I'd like to somehow combine _._2 with pfand be able to do this:
val composedPf : PartialFunction[(String,String),(String,String)] = /*someMagicalOperator(_._2,pf)*/
val collected : Map[String,String] = m.collect( composedPf )
// collected should be Map( "a"->"11" )
so far the best I got was this:
val composedPf = new PartialFunction[(String,String),(String,String)]{
override def isDefinedAt(x: (String, String)): Boolean = pf.isDefinedAt(x._2)
override def apply(v1: (String, String)): (String,String) = v1._1 -> pf(v1._2)
}
is there a better way?
Here is the magical operator:
val composedPf: PartialFunction[(String, String), (String, String)] =
{case (k, v) if pf.isDefinedAt(v) => (k, pf(v))}
Another option, without creating a composed function, is this:
m.filter(e => pf.isDefinedAt(e._2)).mapValues(pf)
There is a function in Scalaz, that does exactly that: second
scala> m collect pf.second
res0: scala.collection.immutable.Map[String,String] = Map(a -> 11)
This works, because PartialFunction is an instance of Arrow (a generalized function) typeclass, and second is one of the common operations defined for arrows.
This is a follow up question from here. I am trying to implement k-means based on this implementation. It works great, but I would like to replace groupByKey() with reduceByKey(), but I am not sure how (I am not worried about performance now). Here is the relevant minified code:
val data = sc.textFile("dense.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()
val read_mean_centroids = sc.textFile("centroids.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1))))
var centroids = read_mean_centroids.takeSample(false, K, 42).map(x => x._2)
do {
var closest = read_mean_centroids.map(p => (closestPoint(p._2, centroids), p._2))
var pointsGroup = closest.groupByKey() // <-- THE VICTIM :)
var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
..
Notice that println(newCentroids) will give:
Map(23 -> (-6.269305E-4, -0.0011746404, -4.08004E-5), 8 -> (-5.108732E-4, 7.336348E-4, -3.707591E-4), 17 -> (-0.0016383086, -0.0016974678, 1.45..
and println(closest):
MapPartitionsRDD[6] at map at kmeans.scala:75
Relevant question: Using reduceByKey in Apache Spark (Scala).
Some documentation:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def reduceByKey(func: (V, V) ⇒ V, numPartitions: Int): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def groupByKey(): RDD[(K, Iterable[V])]
Group the values for each key in the RDD into a single sequence.
You could use an aggregateByKey() (a bit more natural than reduceByKey()) like this to compute newCentroids:
val newCentroids = closest.aggregateByKey((Vector.zeros(dim), 0L))(
(agg, v) => (agg._1 += v, agg._2 + 1L),
(agg1, agg2) => (agg1._1 += agg2._1, agg1._2 + agg2._2)
).mapValues(agg => agg._1/agg._2).collectAsMap
For this to work you will need to compute the dimensionality of your data, i.e. dim, but you only need to do this once. You could probably use something like val dim = data.first._2.length.