How to skip appending some condition in map scala? - scala

I have a situation in which I need to create a map from a collection by applying some filter inside like given in the code below:
//Say I have a list
//I don't have to apply filter function ...
val myList = List(2,3,4,5)
val evenList = myList.map(x=>{
if ( x is even) x
else 0
}
//And the output is : List(2,0,4,0)
//The output actually needed was List(2,4) without applying filter on top like - ```myList.filter```
//I have objects instead of numbers of a case class so the output becomes :List(object1, None, object2, None)
But actual output needed was : List(object1,object2)
//The updated scenario
val basket = List(2,4,5,6)
case class Apple(name:Option[String],size:Option[Int])
val listApples: List[Apple] = basket.map(x=>{
val r = new scala.util.Random
val size = r.nextInt(10)
if(x%2!=0){
Apple(None,None)
}
else Apple(Some("my-apple"),Some(size))
})
Current Output :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(None,None)
Apple(Some(my-apple),Some(4))
Expected was :
Apple(Some(my-apple),Some(2))
Apple(Some(my-apple),Some(0))
Apple(Some(my-apple),Some(4))

I believe collect best suits your case. It takes a partial function as an argument and only if that function matches then the element is transformed and added to result:
val myList = List(2,3,4,5)
case class Wrapper(i: Int)
val evenList = myList.collect{
case x if x % 2 == 0 => Wrapper(x)
}
In this case only 2 and 4 will be wrapped inside Wrapper:
List(Wrapper(2), Wrapper(4))

I'm not sure if I understand you correctly, but why not just use a filter directly:
val myList = List(2,3,4,5)
myList.filter(_ % 2 == 0)
If you want to have the Filter as a function:
def even(n:Int) = n % 2 == 0
myList.filter(even)
After question update, here the difference between filter and collect:
Filter:
myList
.filter(even)
.map(s => Apple(Some("my-apple"),Some(s)))
Collect:
myList
.collect{ case s if(even(s)) => Apple(Some("my-apple"),Some(s))}
Both return List(Apple(Some(my-apple),Some(2)), Apple(Some(my-apple),Some(4)))
So the only difference is that you can do both steps at once with collect.
However for me to separate these 2 steps is mostly more readable.

Related

In scala, how do I get access to specific index in tuple?

I am implementing function that gets random index and returns the element at random index of tuple.
I know that for tuple like, val a=(1,2,3) a._1=2
However, when I use random index val index=random_index(integer that is smaller than size of tuple), a._index doesnt work.
You can use productElement, note that it is zero based and has return type of Any:
val a=(1,2,3)
a.productElement(1) // returns 2nd element
If you know random_index only at runtime the best what you can have is (as #GuruStron answered)
val a = (1,2,3)
val i = 1
val x = a.productElement(i)
x: Any // 2
If you know random_index at compile time you can do
import shapeless.syntax.std.tuple._
val a = (1,2,3)
val x = a(1)
x: Int // 2 // not just Any
// a(4) // doesn't compile
val i = 1
// a(i) // doesn't compile
https://github.com/milessabin/shapeless/wiki/Feature-overview:-shapeless-2.0.0#hlist-style-operations-on-standard-scala-tuples
Although this a(1) seems to be pretty similar to standard a._1.

Need some clarity on for loop usage in Spark scala

I am trying to run below code to create pair using spark rdd, when I am the code only for one mapping it's working fine but when I am using for loop to iterate over all the elements then I am not getting the expected output.
val file = sc.textFile("filepath")
file.collect.foreach(println)
1,Abc,300
2,Def,200
3,Xyz,400
file.map(x => x.split(",")).map(x => (x(0)->x(1))).collect.foreach(println)
Output is coming as expected :-
(1,Abc)
(2,Def)
(3,Xyz)
Using for loop:-
file.map(x => x.split(",")).map(x => {
for(i <- 0 to 2){
x(0) -> x(i)
}
}).collect.foreach(println)
Output is coming as (which is not the expected output):-
()
()
()
Expected output is:-
(1,1)
(2,2)
(3,3)
(1,Abc)
(2,Def)
(3,Xyz)
(1,300)
(2,200)
(3,400)
tried using yield in for loop but getting some syntax errors.
First, let me explain the output you obtain. A for loop simply returns an object of type Unit, regardless of what's in it. Here is a way to verify that using the REPL:
scala> val test = for(i<- 0 to 2) { i }
test: Unit = ()
NB: () is the only object of type Unit
If you want to change that, you need to use yield as you suggest it. Here is an example:
scala> val test = for(i<- 0 to 2) yield { i }
test: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 2)
That's more like it.
In your case, adding yield is not enough. It would yield collections of tuples like this:
Vector((1,1), (1,Abc), (1,300))
Vector((2,2), (2,Def), (2,200))
Vector((3,3), (3,Xyz), (3,400))
What you need is to use is the flatMap function which will flatten the collections (i.e. it transforms a RDD of collections of elements into a RDD of elements).
file.map(x => x.split(",")).flatMap(x => {
for(i <- 0 to 2) yield {
x(0) -> x(i)
}
}).collect.foreach(println)
which gives you what you expect:
(1,1)
(1,Abc)
(1,300)
(2,2)
(2,Def)
(2,200)
(3,3)
(3,Xyz)
(3,400)

Scala - conditional product/join of two arrays with default values using for comprehensions

I have two Sequences, say:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
How do I get a product such that elements of the first array are joined with each element of the second array which startsWith the former and also yield a default empty result when no element in the second array meets the condition.
Effectively to get an Output:
expectedProductArray = Array("B-B25", "B-B80", "B-B50", "L-Default", "T-T70")
I tried doing,
val myProductArray: Array[String] = for {
f <- first
s <- second if s.startsWith(f)
} yield s"""$f-$s"""
and i get:
myProductArray = Array("B-B25", "B-B80", "B-B50", "T-T70")
Is there an Idiomatic way of adding a default value for values in first sequence not having a corresponding value in the second sequence with the given criteria? Appreciate your thoughts.
Here's one approach by making array second a Map and looking up the Map for elements in array first with getOrElse:
val first = Array("B", "L", "T")
val second = Array("T70", "B25", "B80", "A50", "M100", "B50")
val m = second.groupBy(_(0).toString)
// m: scala.collection.immutable.Map[String,Array[String]] =
// Map(M -> Array(M100), A -> Array(A50), B -> Array(B25, B80, B50), T -> Array(T70))
first.flatMap(x => m.getOrElse(x, Array("Default")).map(x + "-" + _))
// res1: Array[String] = Array(B-B25, B-B80, B-B50, L-Default, T-T70)
In case you prefer using for-comprehension:
for {
x <- first
y <- m.getOrElse(x, Array("Default"))
} yield s"$x-$y"

How do I remove empty dataframes from a sequence of dataframes in Scala

How do I remove empty data frames from a sequence of data frames? In this below code snippet, there are many empty data frames in twoColDF. Also another question for the below for loop, is there a way that I can make this efficient? I tried rewriting this to below line but didn't work
//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten
var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2 )
{
val i = 0
for (j <- i + 1 until colCount) {
twoColDF = groupCount(j).map(y => {
finalDF.map(x => x.filter(df(cols(j)) === y))
})
}
}finalDF = twoColDF.flatten
Given a set of DataFrames, you can access each DataFrame's underlying RDD and use isEmpty to filter out the empty ones:
val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())
As for your other question - I can't understand what your code tries to do, but I'd first try to convert it into something more functional (remove use of vars and imperative conditionals). If I'm guessing the meaning of your inputs, here's something that might be equivalent to what you're trying to do:
var input: Seq[DataFrame] = ???
// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???
// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
df <- input
index <- groupCount.keySet
value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)
// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())

How to copy matrix to column array

I'm trying to copy a column of a matrix into an array, also I want to make this matrix public.
Heres my code:
val years = Array.ofDim[String](1000, 1)
val bufferedSource = io.Source.fromFile("Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val i=0;
//println("THEME, TITLE, ARTIST, YEAR, SPOTIFY_URL")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
years(i)=cols(3)(i)
}
I want the cols to be a global matrix and copy the column 3 to years, because of the method of that I get cols I dont know how to define it
There're three different problems in your attempt:
Your regexp will fail for this dataset. I suggest you change it to:
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
This will capture the blocks wrapped in double quotes but containing commas (courtesy of Luke Sheppard on regexr)
This val i=0; is not very scala-ish / functional. We can replace it by a zipWithIndex in the for comprehension:
for ((line, count) <- bufferedSource.getLines.zipWithIndex)
You can create the "global matrix" by extracting elements from each line (val Array (...)) and returning them as the value of the for-comprehension block (yield):
It looks like that:
for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line....
...
(theme,title,artist,year,spotify_url)
}
And here is the complete solution:
val bufferedSource = io.Source.fromFile("/tmp/Top_1_000_Songs_To_Hear_Before_You_Die.csv")
val years = Array.ofDim[String](1000, 1)
val regex = ",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"
val iteratorMatrix = for ((line, count) <- bufferedSource.getLines.zipWithIndex) yield {
val Array(theme,title,artist,year,spotify_url) = line.split(regex, -1).map(_.trim)
years(count) = Array(year)
(theme,title,artist,year,spotify_url)
}
// will actually consume the iterator and fill in globalMatrix AND years
val globalMatrix = iteratorMatrix.toList
Here's a function that will get the col column from the CSV. There is no error handling here for any empty row or other conditions. This is a proof of concept so add your own error handling as you see fit.
val years = (fileName: String, col: Int) => scala.io.Source.fromFile(fileName)
.getLines()
.map(_.split(",")(col).trim())
Here's a suggestion if you are looking to keep the contents of the file in a map. Again there's no error handling just proof of concept.
val yearColumn = 3
val fileName = "Top_1_000_Songs_To_Hear_Before_You_Die.csv"
def lines(file: String) = scala.io.Source.fromFile(file).getLines()
val mapRow = (row: String) => row.split(",").zipWithIndex.foldLeft(Map[Int, String]()){
case (acc, (v, idx)) => acc.updated(idx,v.trim())}
def mapColumns = (values: Iterator[String]) =>
values.zipWithIndex.foldLeft(Map[Int, Map[Int, String]]()){
case (acc, (v, idx)) => acc.updated(idx, mapRow(v))}
val parser = lines _ andThen mapColumns
val matrix = parser(fileName)
val years = matrix.flatMap(_.swap._1.get(yearColumn))
This will build a Map[Int,Map[Int, String]] which you can use elsewhere. The first index of the map is the row number and the index of the inner map is the column number. years is an Iterable[String] that contains the year values.
Consider adding contents to a collection at the same time as it is created, in contrast to allocate space first and then update it; for instance like this,
val rawSongsInfo = io.Source.fromFile("Top_Songs.csv").getLines
val cols = for (rsi <- rawSongsInfo) yield rsi.split(",").map(_.trim)
val years = cols.map(_(3))