IndexedSeq and removing multiple elements - scala

I have mySeq: IndexedSeq[A] and another myIncludedSeq: IndexedSeq[A], where every element of myIncludedSeq, is contained in mySeq.
I want to create a new IndexedSeq[A] from mySeq without all elements from myIncludedSeq.
I can't find any nice functional approach for this problem. How would you approach it?
Example:
val mySeq = IndexedSeq("a", "b", "a", "c", "d", "a")
val myIncludedSeq = IndexedSeq("a", "d", "a")
//magic
val expectedResult = IndexedSeq("b", "c", "a") //the order does not matter

How about this?
val original = IndexedSeq("a", "b", "a", "c", "d", "a")
val exclude = IndexedSeq("a", "d", "a")
val result = original.diff(exclude)
// IndexedSeq[String] = Vector(b, c, a)
From list's diff doc:
Computes the multiset difference between this list and another sequence.
Returns a new list which contains all elements of this list except
some of occurrences of elements that also appear in "excluding" list. If an
element value x appears n times in that, then the first n occurrences
of x will not form part of the result, but any following occurrences
will.

Related

SparkMap[RDD] - map toMany-One

I have a rdd in the following form:
[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]
and
(pos1 ,pos2, ............, posn)
Q: How can I map each position to its key ?(to be something like the following)
("b", "e", "a", "d", "a" .....)
// "b" correspond to pos 1, "e" correspond to pose 2 and ...
Example(edit) :
// chunk of my data
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)
// from rdd I can create my position rdd which is something like:
val positions = Vector(1,2,3,4,.......125) // my positions
// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...
Not sure you have to use Spark to address this specific use case (starting with a Vector, ending with a Vector containing all your data characters).
Nevertheless, here's some suggestion if it suits your needs :
val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)
val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
.reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
.toVector
.sortBy(_._1) //We sort data based on position
.map(_._2) // We only keep characters
.mkString

Splitting a Scala List into parts using a given list of part sizes.[Partitioning]

I got two lists:
val list1:List[Int] = List(5, 2, 6)
val list2:List[Any] = List("a", "b", "c", "d", "e", "f", "g", "h", "i", "j","k")
such that list1.sum >= list2.size
I want a list of lists formed with elements in list2 consecutively
with the sizes mentioned in list1.
For example:
if list1 is List(5,2,4) the result I want is:
List(List("a", "b", "c", "d", "e"),List("f", "g"),List("h", "i", "j","k"))
if list1 is List(5,4,6) the result I want is:
List(List("a", "b", "c", "d", "e"),List("f", "g","h", "i"),List("j","k"))
How can I do that with concise code.
Turn list2 into an Iterator then map over list1.
val itr = list2.iterator
list1.map(itr.take(_).toList)
//res0: List[List[Any]] = List(List(a, b, c, d, e), List(f, g), List(h, i, j, k))
update:
While this appears to give the desired results, it has been pointed out elsewhere that reusing the iterator is actually unsafe and its behavior is not guaranteed.
With some modifications a safer version can be achieved.
val itr = list2.iterator
list1.map(List.fill(_)(if (itr.hasNext) Some(itr.next) else None).flatten)
-- or --
import util.Try
val itr = list2.iterator
list1.map(List.fill(_)(Try(itr.next).toOption).flatten)
you can slice on list2 based on the size you get from list1,
def main(args: Array[String]): Unit = {
val groupingList: List[Int] = List(5, 2, 4)
val data = List("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
//0, 5
//5, (5 + 2)
//5 + 2, (5 + 2 + 4)
val grouped = groupingList.zipWithIndex.map { case (_, index) =>
val start = groupingList.slice(0, index).sum
val end = groupingList.slice(0, index + 1).sum
data.slice(start, end)
}
println(grouped)
}
result: List(List(a, b, c, d, e), List(f, g), List(h, i, j, k))
Also read: How slice works
i like show it using ex
list1=List("one","two","three")
List[String] = List(one, two, three)
list2=List("red","green","blue")
List[String] = List(red, green, blue)
list1::list2
List[java.io.Serializable] = List(List(one, two, three), red, green, blue)
var listSum=list1::list2
List[java.io.Serializable] = List(List(one, two, three), red, green, blue)
listSum
List[java.io.Serializable] = List(List(one, two, three), red, green, blue)
we can "::" for insert one list to another list in Scala
Just found by me, even the following code worked, but the code posted by jwvh appears more concise than this, and I understand making list1 a Vector makes more sense regarding performance of element access in the following code:
list1.scan(0)(_+_).sliding(2).toList.map(x=>list2.drop(x(0)).take(x(1)-x(0)))
or
list1.scan(0)(_+_).zip(list1).map(x=>list2.drop(x._1).take(x._2))
list1.scan(0)(_+_).sliding(2).map(x=>list2.slice(x(0),x(1))).toList

get indexes of elements containig some value and the coresponding elements from another list

The queston is a mouthful, but the idea pretty simple.
I have 3 lists and a string.
val a = List("x", "y", "z")
val b = List("a1", "a2", "b1", "b2", "c1", "c2", "d1", "d2")
val c = List("1", "1", "2", "2", "3", "3", "4", "4")
val d = "xc1b1"
I need to check if d contains elements from a. If it does I check the position of all the elemtns from b that are present in d and return a set of elements from c that corespond these positions.
The result for the given example is
Set("3", "2")
But when I try
if(a.exists(d.contains)) c(b.indexWhere(d.contains))
I only get
Any = 2
Which corespond to the first encountered elemnt from b ie b1
How would I get the set?
-
if(a.exists(d.contains)) b.zip(c).collect{
case (x, y) if d.contains(x) => y
}
// res1: Any = List(2, 3)
If you need a Set:
if(a.exists(d.contains)) b.zip(c).collect{
case (x, y) if d.contains(x) => y
}.toSet
// res2: Any = Set(2, 3)
I think I've understood what you need to do here, although the question could do with some clarification.
These are the two ways of getting to your set that I found:
if(a.exists(d.contains)) b.collect {
case x if d.contains(x) => c(b.indexOf(x))
}.toSet
if(a.exists(d.contains)) b.filter(d.contains).map(b.indexOf).map(c).toSet
Both find elements of b that are in d, then find their index in b and find their relative elements in c. The first way is more explicit in what it's doing, while the second way is more concise.

filter element of rdd[array[string] by array of [0,1]

I want to select some elements(feature) of rdd based on binary array. I have an array consisting of 0,1 with size 40 that specify if an element is present at that index or not.
My RDD was created form kddcup99 dataset
val rdd=sc.textfile("./data/kddcup.txt")
val data=rdd.map(_.split(','))
How can I to filter or select elements of data(rdd[Array[String]]) whose value of correspondent index in binary array is 1?
If I understood your question correctly, you have an array like :
val arr = Array(1, 0, 1, 1, 1, 0)
And a RDD[Array[String]] which looks like :
val rdd = sc.parallelize(Array(
Array("A", "B", "C", "D", "E", "F") ,
Array("G", "H", "I", "J", "K", "L")
) )
Now, to get elements at the indices where arr has 1, you need to first get the indices which have 1 as the value in arr
val requiredIndices = arr.zipWithIndex.filter(_._1 == 1).map(_._2)
requiredIndices: Array[Int] = Array(0, 2, 3, 4)
And then similarily with RDD, you can use zipWithIndex and contains to check if that index is available in your requiredIndices array :
rdd.map(_.zipWithIndex.filter(x => requiredIndices.contains(x._2) ).map(_._1) )
// Array[Array[String]] = Array(Array(A, C, D, E), Array(G, I, J, K))

Scala: Create a new list where each element is the elemnt of old list repeated with different suffix

This seems like it should be really easy but I can't quite put it together. I want to take a list of strings and create a new list that contains two of each element form the first list but with a different suffix. So:
List("a", "b", "c") -> List("a_x", "a_y", "b_x", "b_y", "c_x", "c_y"
I tried
val list2 = list1.map(i=> i+"_x", i+"_y")
but scala said I had too many arguments. This got close:
val list2 = list1.map(i=> (i+"_x", i+"_y"))
but it produced List(("a_x", "a_y"), ("b_x", "b_y"), ("c_x", "c_y")) which is not what I want. I'm sure I ;m missing something obvious.
You want flatMap, to first map, then flatten the structure of the result into a flat list. Each individual result must itself be a collection (not a tuple):
scala> List("a", "b", "c").flatMap(i => List(i + "-x", i + "-y"))
res0: List[String] = List(a-x, a-y, b-x, b-y, c-x, c-y)
With a for comprehension:
scala> val prefixes = List("a", "b", "c")
prefixes: List[String] = List(a, b, c)
scala> val suffixes = List("x", "y")
suffixes: List[String] = List(x, y)
scala> for (prefix <- prefixes; suffix <- suffixes) yield prefix + "_" + suffix
res1: List[String] = List(a_x, a_y, b_x, b_y, c_x, c_y)
This is basically just syntactic sugar for Seth Tisue's answer.