I have 3 vals, each of type Array[String]
they are all equal in length
val1.length == val2.length // true
Next, I created a case class as following:
case class resource(name: String, count: Int, location: String)
I want to create a list, a List[resource] such that each object of this list is created from the corresponding elements of the vals, i.e, val1, val2, val3
Something like this:
val newList: List[resource] = (val1(0), val2(0).toInt, val3(0)),
(val1(1), val2(1).toInt, val3(1)),
...
(val1(val1.length), val2(val2.length).toInt, val3(val3.length)
I'm not sure how to proceed. Do I use flatMap, foreach, for-loops, or something else?
The idea is to create this abovementioned newList and compare it to the result obtained from a SQL database using doobie.
val comparator = sql"sql statment".query[resource]
comparator.to[List].transact(xa).unsafeRunSync()
You can zip your arrays, which combines the corresponding elements of the zipped sequences into tuples, and them map the resource.apply method over the combined sequence:
val val1: Array[String] = Array("name 1", "name 2", "name 3")
val val2: Array[String] = Array("1", "2", "3")
val val3: Array[String] = Array("loc 1", "loc 2", "loc 3")
scala> (val1, val2.map(_.toInt), val3).zipped.map(resource)
res1: Array[resource] = Array(resource(name 1,1,loc 1), resource(name 2,2,loc 2), resource(name 3,3,loc 3))
You can then convert this Array to List if needed:
scala> (val1, val2.map(_.toInt), val3).zipped.map(resource).toList
res2: List[resource] = List(resource(name 1,1,loc 1), resource(name 2,2,loc 2), resource(name 3,3,loc 3))
Lets assume you have 3 lists such as
case class resource(name: String, count: Int, location: String)
val list1 = Array("s1","s2","s3")
val list2 = Array("1","2","3")
val list3 = Array("s4","s5","s6")
You can simply use the for loops to get the desired list of resource
val result = for(
l1 <- list1;
l2 <- list2;
l3 <- list3
)
yield (resource(l1,l2.toInt,l3) )
Note :- This code will work only if the second list contains a list of integers else we will get NumberFormatException.
Related
Given a list of lists, where each list has an object that represents the key, I need to write a full outer join that combines all the lists. Each record in the resulting list is the combination of all the fields of all the lists. In case that one key is present in list 1 and not present in list 2, then the fields in list 2 should be null or empty.
One solution I thought of is to embed an in-memory database, create the tables, run a select and get the result. However, I'd like to know if there are any libraries that handle this in a more simpler way. Any ideas?
For example, let's say I have two lists, where the key is the first field in the list:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val allLists = List (list1, list2)
The full outer joined list would be:
val allListsJoined = List ((1,2,"A"), (3,4,None), (5,6,None), (7,None,"B"))
NOTE: the solution needs to work for N lists
def fullOuterJoin[K, V1, V2](xs: List[(K, V1)], ys: List[(K, V2)]): List[(K, Option[V1], Option[V2])] = {
val map1 = xs.toMap
val map2 = ys.toMap
val allKeys = map1.keySet ++ map2.keySet
allKeys.toList.map(k => (k, map1.get(k), map2.get(k)))
}
Example usage:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
println(fullOuterJoin(list1, list2))
Which prints:
List((1,Some(2),Some(A)), (3,Some(4),None), (5,Some(6),None), (7,None,Some(B)))
Edit per suggestion in comments:
If you're interested in joining an arbitrary number of lists and don't care about type info, here's a version that does that:
def fullOuterJoin[K](xs: List[List[(K, Any)]]): List[(K, List[Option[Any]])] = {
val maps = xs.map(_.toMap)
val allKeys = maps.map(_.keySet).reduce(_ ++ _)
allKeys.toList.map(k => (k, maps.map(m => m.get(k))))
}
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val list3 = List((1, 3.5), (7, 4.0))
val lists = List(list1, list2, list3)
println(fullOuterJoin(lists))
which outputs:
List((1,List(Some(2), Some(A), Some(3.5))), (3,List(Some(4), None, None)), (5,List(Some(6), None, None)), (7,List(None, Some(B), Some(4.0))))
If you want both an arbitrary number of lists and well-typed results, that's probably beyond the scope of a stackoverflow answer but could probably be accomplished with shapeless.
Here is a way to do it using collect separately on both list
val list1Ite = list1.collect{
case ele if list2.filter(e=> e._1 == ele._1).size>0 => { //if list2 _1 contains ele._1
val left = list2.find(e=> e._1 == ele._1) //find the available element
(ele._1, ele._2, left.get._2) //perform join
}
case others => (others._1, others._2, None) //others add None as _3
}
//list1Ite: List[(Int, Int, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None))
Do similar operation but exclude the elements which are already available in list1Ite
val list2Ite = list2.collect{
case ele if list1.filter(e=> e._1 == ele._1).size==0 => (ele._1, None , ele._2)
}
//list2Ite: List[(Int, None.type, String)] = List((7,None,B))
Combine both list1Ite and list2Ite to result
val result = list1Ite.++(list2Ite)
result: List[(Int, Any, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None), (7,None,B))
I am trying to read a dataset and process it; dataset row type is (string,string,string,Map[String,String]), the num of Map.keys is from 1 to 3,so one row will become 1-3 rows like(string,string,string,k,v).
I actually realize it using code as follows:
var arr = new ArrayBuffer[Array[String]]()
myDataset.collect.foreach{
f:(String,String,String,Map[String,String]) =>
val ma = f._4
for((k,v)<-ma) {
arr += Array(f._1,f._2,f._3,k,v)
}
}
Orgdata like this(one row in mydataset:hundreds of millions ):
val a = ("111","222","333",Map("k1"->"v1","k2"->"v2"))
expected output:
("111","222","333","k1","v1")
("111","222","333","k2","v2")
But big data cause OOM problem,so is there other ways to accomplish this ? or how to optimize my code to avoid OOM?
You can just explode the map column and then select the exploded columns :
val df = sc.parallelize(Array(
("111","222","333",Map("k1"->"v1","k2"->"v2"))
)).toDF("a", "b", "c", "d")
df.select($"*", explode($"d") )
.select("a", "b", "c" ,"key", "value")
.as[(String, String, String, String, String)]
.first
// (String, String, String, String, String) = (111,222,333,k1,v1)
I have two lists, "firstName" and "lastName", I want to create a new list "fullName" by concatenating corresponding elements (Strings) from first list and second list as show below.
Input Lists:
firstName: List[String] = List("Rama","Dev")
lastName: List[String] = List("krish","pandi")
Expected output:
Fullname:List[String] = List("Rama krish", "Dev Pandi")
Could you please let me know how this can be achieved in functional style?
Taking this step by step, the first thing to do is to zip the lists to get a new list of pairs:
scala> val firstName: List[String] = List("Rama","Dev")
firstName: List[String] = List(Rama, Dev)
scala> val lastName: List[String] = List("krish","pandi")
lastName: List[String] = List(krish, pandi)
scala> firstName.zip(lastName)
res0: List[(String, String)] = List((Rama,krish), (Dev,pandi))
Note that if the lists aren't the same length, the longer one will be truncated.
Next you can use map to take each pair and turn it into a new string:
scala> firstName.zip(lastName).map {
| case (first, last) => s"$first $last"
| }
res1: List[String] = List(Rama krish, Dev pandi)
I'm using string interpolation here (the s"$variableName" part), but you could also just use + to concatenate the strings.
I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)
I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}
Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))
I have two tuples in Scala of the following form:
val array1 = (bucket1, Seq((dateA, Amount11), (dateB, Amount12), (dateC, Amount13)))
val array2 = (bucket2, Seq((dateA, Amount21), (dateB, Amount22), (dateC, Amount23)))
What is the quickest way to make a .csv file in Scala such that:
date* is pivot.
bucket* is column name.
Amount* fill the table.
It needs to look something like this:
Dates______________bucket1__________bucket2
dateA______________Amount11________Amount21
dateB______________Amount12________Amount22
dateC______________Amount13________Amount23
You can make it shorter by chaining some operations, but :
scala> val array1 = ("bucket1", Seq(("dateA", "Amount11"), ("dateB", "Amount12"), ("dateC", "Amount13")))
array1: (String, Seq[(String, String)]) =
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)))
scala> val array2 = ("bucket2", Seq(("dateA", "Amount21"), ("dateB", "Amount22"), ("dateC", "Amount23")))
array2: (String, Seq[(String, String)]) =
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
// Single array to work with
scala> val arrays = List(array1, array2)
arrays: List[(String, Seq[(String, String)])] = List(
(bucket1,List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13))),
(bucket2,List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23)))
)
// Split between buckets and the values
scala> val (buckets, values) = arrays.unzip
buckets: List[String] = List(bucket1, bucket2)
values: List[Seq[(String, String)]] = List(
List((dateA,Amount11), (dateB,Amount12), (dateC,Amount13)),
List((dateA,Amount21), (dateB,Amount22), (dateC,Amount23))
)
// Format the data
// Note that this does not keep the 'dateX' order
scala> val grouped = values.flatten
.groupBy(_._1)
.map { case (date, list) => date::(list.map(_._2)) }
grouped: scala.collection.immutable.Iterable[List[String]] = List(
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join everything, and add the "Dates" column in front of the buckets
scala> val table = ("Dates"::buckets)::grouped.toList
table: List[List[String]] = List(
List(Dates, bucket1, bucket2),
List(dateC, Amount13, Amount23),
List(dateB, Amount12, Amount22),
List(dateA, Amount11, Amount21)
)
// Join the rows by ',' and the lines by "\n"
scala> val string = table.map(_.mkString(",")).mkString("\n")
string: String =
Dates,bucket1,bucket2
dateC,Amount13,Amount23
dateB,Amount12,Amount22
dateA,Amount11,Amount21