Filter a list by item index? - scala

val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = data.filter(datum => selection.contains(datum.MYINDEX))
// INVALID CODE HERE ^
// selectedData: List("foo", "bash")
Say I want to filter a List given a list of selected indices. If, in the filter method, I could reference the index of a list item then I could solve this as above, but datum.MYINDEX isn't valid in the above case.
How could I do this instead?

How about using zipWithIndex to keep a reference to the item's index, filtering as such, then mapping the index away?
data.zipWithIndex
.filter{ case (datum, index) => selection.contains(index) }
.map(_._1)

It's neater to do it the other way about (although potentially slow with Lists as indexing is slow (O(n)). Vectors would be better. On the other hand, the contains of the other solution for every item in data isn't exactly fast)
val data = List("foo", "bar", "bash")
//> data : List[String] = List(foo, bar, bash)
val selection = List(0, 2)
//> selection : List[Int] = List(0, 2)
selection.map(index=>data(index))
//> res0: List[String] = List(foo, bash)

First solution that came to my mind was to create a list of pairs (element, index), filter every element by checking if selection contains that index, then map resulting list in order to keep only raw elementd (omit index). Code is self explanatory:
data.zipWithIndex.filter(pair => selection.contains(pair._2)).map(_._1)
or more readable:
val elemsWithIndices = data.zipWithIndex
val filteredPairs = elemsWithIndices.filter(pair => selection.contains(pair._2))
val selectedElements = filteredPairs.map(_._1)

This Works :
val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = data.filter(datum => selection.contains(data.indexOf(datum)))
println (selectedData)
output :
List(foo, bash)

Since you have a list of indices already, the most efficient way is to pick those indices directly:
val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = selection.map(index => data(index))
or even:
val selectedData = selection.map(data)
or if you need to preserve the order of the items in data:
val selectedData = selection.sorted.map(data)
UPDATED
In the spirit of finding all the possible algorithms, here's the version using collect:
val selectedData = data
.zipWithIndex
.collect {
case (item, index) if selection.contains(index) => item
}

The following is the probably most scalable way to do it in terms of efficiency, and unlike many answers on SO, actually follows the official scala style guide exactly.
import scala.collection.immutable.HashSet
val selectionSet = new HashSet() ++ selection
data.zipWithIndex.collect {
case (datum, index) if selectionSet.contains(index) => datum
}
If the resulting collection is to be passed to additional map, flatMap, etc, suggest turning data into a lazy sequence. In fact perhaps you should do this anyway in order to avoid 2-passes, one for the zipWithIndex one for the collect, but I doubt when benchmarked one would gain much.

There is actually an easier way to filter by index using the map method. Here is an example
val indices = List(0, 2)
val data = List("a", "b", "c")
println(indices.map(data)) // will print List("a", "c")

Related

Appending to a list in Scala

I have no clue why Scala decided to make this such a chore, but I simply want to add an item to a list
var previousIds: List[String] = List()
I have tried the following:
previousIds ::: List(dataListItem.id)
previousIds :: List(dataListItem.id)
previousIds :+ List(dataListItem.id)
previousIds +: List(dataListItem.id)
previousIds :+ dataListItem.id
previousIds +: dataListItem.id
In every one of these instances, the line will run but the list still will contain 0 items
When I try to add a new list:
val list = List[String](dataListItem.id)
previousIds += list
I get an error that list needs to be a string. When I add a string
previousIds += dataListItem.id
I get an error that it needs to be a list
For some reason, the only thing that will work is the following:
previousIds :::= List[String](dataListItem.id)
which seems really excessive, since, adding to a list should be a trivial option. I have no idea why nothing else works though.
How do you add an item to a list in scala (that already exists) without having to make a new list like I am doing?
Next code should help you for a start.
My assumption you are dealing with mutable collections:
val buf = scala.collection.mutable.ListBuffer.empty[String]
buf += "test"
buf.toList
In case you are dealing with immutable collections next approach would help:
val previousIds = List[String]("A", "TestB")
val newList = previousIds :: List("TestB")
Please refer to documentation for mode details:
http://www.scala-lang.org/api/current/scala/collection/immutable/List.html
Use MutableList
scala> var a = scala.collection.mutable.MutableList[String]()
a: scala.collection.mutable.MutableList[String] = MutableList()
scala> a += "s"
res0: scala.collection.mutable.MutableList[String] = MutableList(s)
scala> a :+= "s"
scala> a
res1: scala.collection.mutable.MutableList[String] = MutableList(s, s)

What would be the idiomatic way to map this data?

I'm pretty new to Scala and while working I found the need to map some data found within a log file. The log file follows this format (values changed from original):
1343,37284.ab1-tbd,283
1344,37284.ab1-tbd,284
1345,37284.ab1-tbd,0
1346,28374.ab1-tbd,107
1347,28374.ab1-tbd,0
...
The first number is not important, but the number portion of the second field and the third field are what need to be mapped. I need the map to have keys that correspond to the number portion of the second field that map to a list of every 3rd field that follows it. That was a bad explanation, so as an example here is what I would need after parsing the above log:
{
37284 => { 283, 284, 0 }
28374 => { 107, 0 }
}
The solution I came up with is this:
val data = for (line <- Source fromFile "path/to/log" getLines) yield line.split(',')
val ls = data.toList
val keys = ls.map(_(1).split('.')(0).toInt)
val vals = ls.map(_(2).toInt)
val keys2vals = for {
(k, v) <- (keys zip vals).groupBy(_._1)
list = v.map(_._2)
} yield (k, list)
Is there a more idiomatic way to do this in Scala? This seems kinda awkward and convoluted to me. (When explaining, please assume little to no background knowledge of langauge features, etc.) Also, if later down the line I wanted to exclude the number zero from the mappings, how would I do so?
EDIT:
In addition, how would I similarly turn the data into the form:
{
{ 37284, { 283 ,284, 0 } }
{ 28374, { 107, 0 } }
}
i.e. a List[(Int, List[Int])]? (This form is for use with apache-spark's indexed rdds)
How about:
val assocList = for {
line <- Source.fromFile("path/to/log").getLines
Array(_, snd, thd) = line.split(',')
} yield (snd.split('.')(0).toInt, thd.toInt)
assocList.toList.groupBy(_._1).mapValues(_.map(_._2))
If you want a List[(Int, List[Int])], add .toList.
I might be tempted to write it in fewer lines (arguably clearer too) like this:
val l = List((1343,"37284.ab1-tbd",283),
(1344,"37284.ab1-tbd",284),
(1345,"37284.ab1-tbd",0),
(1346,"28374.ab1-tbd",107),
(1347,"28374.ab1-tbd",0))
// drop the unused data
val m = l.map(a => a._2.split('.')(0).toInt -> a._3)
// transform to Map of key -> matchedValues
m.groupBy(_._1) mapValues (_ map (_._2))
gives:
m: List[(Int, Int)] = List((37284,283), (37284,284), (37284,0), (28374,107), (28374,0))
res0: scala.collection.immutable.Map[Int,List[Int]] = Map(37284 -> List(283, 284, 0), 28374 -> List(107, 0))
"Also, if later down the line I wanted to exclude the number zero from the mappings, how would I do so?" - You could filter the initial list:
val m = l.filter(_._3 != 0).map(a => a._2.split('.')(0) -> a._3)
To convert to List[(Int, List[Int])] you just need to call .toList on the resulting Map.
val lines = io.Source.fromFile("path/to/log").getLines.toList
lines.map{x=>
val Array(_,second,_,fourth) = x.split("[,.]")
(second,fourth)
}.groupBy(_._1)
.mapValues(_.map(_._2))

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work

Scala - Getting sequence id of elements during map

Scala newbie question.
I want to map a list to another list but I want to every object to know its sequence number.
In the following simple code, what is the right alternative to the usage of var v?
class T (s: String, val sequence:Int)
val stringList = List("a","b","C")
var v = 0
val tList = stringList.map(s => { v=v+1; new T(s, v);})
You can use zipWithIndex to get a tuple for each element containing the actual element and the index, then just map that tuple to your object:
List("a", "b", "C")
.zipWithIndex
.map(e => new T(e._1, e._2))
val tList = List.tabulate(stringList.length)(idx => new T(stringList(idx), idx))

Appending element to list in Scala

val indices: List[Int] = List()
val featValues: List[Double] = List()
for (f <- feat) {
val q = f.split(':')
if (q.length == 2) {
println(q.mkString("\n")) // works fine, displays info
indices :+ (q(0).toInt)
featValues :+ (q(1).toDouble)
}
}
println(indices.mkString("\n") + indices.length) // prints nothing and 0?
indices and featValues are not being filled. I'm at a loss here.
You cannot append anything to an immutable data structure such as List stored in a val (immutable named slot).
What your code is doing is creating a new list every time with one element appended, and then throwing it away (by not doing anything with it) — the :+ method on lists does not modify the list in place (even when it's a mutable list such as ArrayBuffer) but always returns a new list.
In order to achieve what you want, the quickest way (as opposed to the right way) is either to use a var (typically preferred):
var xs = List.empty[Int]
xs :+= 123 // same as `xs = xs :+ 123`
or a val containing a mutable collection:
import scala.collection.mutable.ArrayBuffer
val buf = ArrayBuffer.empty[Int]
buf += 123
However, if you really want to make your code idiomatic, you should instead just use a functional approach:
val indiciesAndFeatVals = feat.map { f =>
val Array(q0, q1) = f.split(':') // pattern matching in action
(q0.toInt, q1.toDouble)
}
which will give you a sequence of pairs, which you can then unzip to 2 separate collections:
val (indicies, featVals) = indiciesAndFeatVals.unzip
This approach will avoid the use of any mutable data structures as well as vars (i.e. mutable slots).