Given a constant value and a potentially long Sequence:
a:String = "A"
bs = List(1, 2, 3)
How can you most efficiently construct a Sequence of tuples with the first element equalling a?
Seq(
( "A", 1 ),
( "A", 2 ),
( "A", 3 )
)
Just use a map:
val list = List(1,2,3)
list.map(("A",_))
Output:
res0: List[(String, Int)] = List((A,1), (A,2), (A,3))
Since the most efficient would be to pass (to further receiver) just the seq, and the receiver tuple the elements there, I'd do it with views.
val first = "A"
val bs = (1 to 1000000).view
further( bs.map((first, _)) )
You can do it using map just like in the answer provided by #Pedro or you can use for and yield as below:
val list = List(1,2,3)
val tuple = for {
i <- list
} yield ("A",i)
println(tuple)
Output:
List((A,1), (A,2), (A,3))
You are also asking about the efficient way in your question. Different developers have different opinions between the efficiency of for and map. So, I guess going through the links below gives you more knowledge about the efficiency part.
for vs map in functional programming
Scala style: for vs foreach, filter, map and others
Getting the desugared part of a Scala for/comprehension expression?
Related
I'm use SQLActionBuilder, such as seq"""select ...""", to create a common/wide sql query and I not care about the result column count it is.
Document example use as[TupleX] to decided result type,in my stage, I want use List[String] replace TupleX type.
I have attempted with sQLActionBuilder.as[List[String]] but a compile error encounter:
Error:(8, 187) could not find implicit value for parameter rconv: slick.jdbc.GetResult[List[String]]
val x = reportUtilRepository.find(List())("td_report_region")(1469635200000L, 1475251200000L)("region" :: Nil, "clicks" :: "CPC" :: Nil)(List("1", "2"), List("regionType" -> "1"))(_.as[List[String]]).map(x => println(x.toString))
and sQLActionBuilder.as[List[(String, String, String)]] works well. So how can I use List[String] to match a common result.
I think a straight way is implement a GetResult[List[String]] as compiler tips but I don't know how to do it.Other ways also welcome.
Thanks.
First of all querying database always returns a list of tuples, so the result type will be List[TupleX] because each row is represented as a list record and then columns in each row are respectively tuple elements.
Therefore, your data will look like List((1,2,3),(3,4,5)) where data type is List[(Int, Int, Int)]. To produce List[Int] you might do following:
val a = List((1,2,3),(3,4,5))
a map {x => List(x._1, x._2, x._3)} flatten
res0: List[Int] = List(1, 2, 3, 3, 4, 5)
I have a Spark RDD[Seq[(String,String)]] which contains several group of two words. Now I have to save them to a file in HDFS like this (no matter in which Seq they are):
dog cat
cat mouse
mouse milk
Could someone help me with this? Thanks a lot <3
EDIT:
Thanks for your help. Here is the solution
Code
val seqTermTermRDD: RDD[Seq[(String, String)]] = ...
val termTermRDD: RDD[(String, String)] = seqTermTermRDD.flatMap(identity)
val combinedTermsRDD: RDD[String] = termTermRDD.map{ case(term1, term2) => term1 + " " + term2 }
combinedTermsRDD.saveAsTextFile(outputFile)
RDDs have a neat function called "flatMap" that will do exactly what you want. Think of it as a Map followed by a Flatten (except implemented a little more intelligently)--if the function produces multiple entities, each will be added to the group separately. (You can also use this for many other objects in Scala.)
val seqRDD = sc.parallelize(Seq(Seq(("dog","cat"),("cat","mouse"),("mouse","milk"))),1)
val tupleRDD = seqRDD.flatMap(identity)
tupleRDD.collect //Array((dog,cat), (cat,mouse), (mouse,milk))
Note that I also use the scala feature identity, because flatMap is looking for a function that turns an object of the RDD's type to a TraversableOnce, which a Seq counts as.
You can also use mkString( sep ) function ( where sep is for separator) on Scala collections. Here are some examples: (note that in your code you would replace the last .collect().mkString("\n") with saveAsTextFile(filepath)) to save to Hadoop.
scala> val rdd = sc.parallelize(Seq( Seq(("a", "b"), ("c", "d")), Seq( ("1", "2"), ("3", "4") ) ))
rdd: org.apache.spark.rdd.RDD[Seq[(String, String)]] = ParallelCollectionRDD[6102] at parallelize at <console>:71
scala> rdd.map( _.mkString("\n")) .collect().mkString("\n")
res307: String =
(a,b)
(c,d)
(1,2)
(3,4)
scala> rdd.map( _.mkString("|")) .collect().mkString("\n")
res308: String =
(a,b)|(c,d)
(1,2)|(3,4)
scala> rdd.map( _.mkString("\n")).map(_.replace("(", "").replace(")", "").replace(",", " ")) .collect().mkString("\n")
res309: String =
a b
c d
1 2
3 4
I have two lists
val list1 = List((List("AAA"),"B1","C1"),(List("BBB"),"B2","C2"))
val list2 = List(("AAA",List("a","b","c")),("BBB",List("c","d","e")))
I want to match first element from list2 with first element of list1 and get combined list.
I want output as -
List((List("AAA"),"B1","C1",List("a","b","c")))
How to get above output using Scala??
This is what I came up with:
scala> val l1 = List((List("AAA"),"B1","C1"),(List("BBB"),"B2","C2"))
l1: List[(List[String], String, String)] = List((List(AAA),B1,C1), (List(BBB),B2,C2))
scala> val l2 = List((List("AAA"), List("a", "b", "c")), (List("BBB"), List("c", "d", "e")))
l2: List[(String, List[String])] = List((AAA,List(a, b, c)), (BBB,List(c, d, e)))
scala> l1.collectFirst {
| case tp => l2.find(tp2 => tp2._1.head == tp._1.head).map(founded => (tp._1, tp._2, tp._3, founded._2))
| }.flatten
res2: Option[(List[String], String, String, List[String])] = Some((List(AAA),B1,C1,List(a, b, c)))
You can use collectFirst to filter values you don't want and on every tuple you use find on the second list and map it into the tuple you want.
A couple of notes, this is horrible, I don't know how you got with a Tuple4 in the first place, I personally hate all that tp._* notation, it's hard to read, think about using case classes to wrap all that into some more manageable structure, second I had to use .head which in case of empty list will throw an exception so you may want to do some checking before that, but as I said, I would completely review my code and avoid spending time working on some flawed architecture in the first place.
You can use zip to combine both the list
val list1 = List((List("AAA"),"B1","C1"),(List("BBB"),"B2","C2"))
val list2 = List(("AAA",List("a","b","c")),("BBB",List("c","d","e")))
val combinedList = (list1 zip list2)
combinedList.head will give you the desired result
It will give the combined list from which you can get the first element
Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
I have two lists that I zip and go through the zipped result and call a function. This function returns a List of Strings as response. I now want to collect all the responses that I get and I do not want to have some sort of buffer that would collect the responses for each iteration.
seq1.zip(seq2).foreach((x: (Obj1, Obj1)) => {
callMethod(x._1, x._2) // This method returns a Seq of String when called
}
What I want to avoid is to create a ListBuffer and keep collecting it. Any clues to do it functionally?
Why not use map() to transform each input into a corresponding output ? Here's map() operating in a simple scenario:
scala> val l = List(1,2,3,4,5)
scala> l.map( x => x*2 )
res60: List[Int] = List(2, 4, 6, 8, 10)
so in your case it would look something like:
seq1.zip(seq2).map((x: (Obj1, Obj1)) => callMethod(x._1, x._2))
Given that your function returns a Seq of Strings, you could use flatMap() to flatten the results into one sequence.