Using fold or map to convert a collection - scala

This code converts a List collection of Strings to Doubles with the first String in csv removed :
val points = List(("A1,2,10"), ("A2,2,5"), ("A3,8,4"), ("A4,5,8"), ("A5,7,5"), ("A6,6,4"), ("A7,1,2"), ("A8,4,9"))
points.map (m => (m.split(",")(1).toDouble , m.split(",")(2).toDouble))
//> res0: List[(Double, Double)] = List((2.0,10.0), (2.0,5.0), (8.0,4.0), (5.0,8.0), (7.0,5.0), (6.0,4.0), (1.0,2.0), (4.0,9.0))
Can this be re-written using fold or map so that the length number of elements in the CSV list is not hardcoded ? Currently this is just correct where each String contains 3 CSV elements. But I'm unsure how to re-write it using N elements such as ("A1,2,10,4,5")
Update : Here is possible solution :
points.map (m => (m.split(",").tail).map(m2 => m2.toDouble))
Can be achieved using single traversal instead of two ?

scala> val points = List(("A1,2,10"), ("A2,2,5,6,7,8,9"))
points: List[String] = List(A1,2,10, A2,2,5,6,7,8,9)
scala> points.map(_.split(",").tail.map(_.toDouble))
res0: List[Array[Double]] = List(Array(2.0, 10.0), Array(2.0, 5.0, 6.0, 7.0, 8.0, 9.0))
EDIT
Pretty much was you proposed. As to whether it is doable without a nested .map, it's pretty doubtful : your .csv represents a matrix, which are usually manipulated using nested for loops (or .map).

Tuples are not the right choice here, as tuples are generally more useful if you know the number of elements in the tuple in advance.
You could use arrays though and take advantage of the fact that you can treat arrays as collections:
points.map(_.split(',').drop(1).map(_.toDouble))
.split(',') splits at the comma seperator
.drop(1) removes the first element
.map(_.toDouble) converts the strings to floating point numbers
Update: This is equivalent to your proposed solution.

This has one iteration over the outer list:
points.map(_.split(",").tail.map(_.toDouble))

Related

Is there a way to filter out the elements of a List by checking them against elements of an Array in Scala?

I have a List in Scala:
val hdtList = hdt.split(",").toList
hdtList.foreach(println)
Output:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string
There is an array which is obtained from a dataframe and converting its column to array as below:
val partition_columns = spColsDF.select("partition_columns").collect.flatMap(x => x.getAs[String](0).split(","))
partition_columns.foreach(println)
Output:
source_system_name
period_year
Is there a way to filter out the elements: source_system_name string, period_year bigint from hdtList by checking them against the elements in the Array: partition_columns and put them into new List.
I am confused on applying filter/map on the right collections appropriately and compare them.
Could anyone let me know how can I achieve that ?
Unless I'm misunderstanding the question, I think this is what you need:
val filtered = hdtList.filter { x =>
!partition_columns.exists { col => x.startsWith(col) }
}
In your case you need to use filter, because you need to remove elements from hdtList.
Map is a function that transform elements, there is no way to remove elements from a collection using map. If you have a List of X elements, after map execution, you have X elements, not less, not more.
val newList = hdtList.filter( x => partition_columns.exists(x.startsWith) )
Be aware that the combination filter+exists between two List is an algorithm NxM. If your Lists are big, you will have a performance problem.
One way to solve that problem is using Sets.
It might be useful to have both lists: the hdt elements referenced in partition_columns, and the hdt elements that aren't.
val (pc
,notPc) = hdtList.partition( w =>
partition_columns.contains(w.takeWhile(_!=' ')))
//pc: List[String] = List(period_year bigint, source_system_name string)
//notPc: List[String] = List(forecast_id bigint, period_num bigint, ... etc.

How to sort a list in scala

I am a newbie in scala and I need to sort a very large list with 40000 integers.
The operation is performed many times. So performance is very important.
What is the best method for sorting?
You can sort the list with List.sortWith() by providing a relevant function literal. For example, the following code prints all elements of sorted list which contains all elements of the initial list in alphabetical order of the first character lowercased:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s: String, t: String)
=> s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
Much shorter version will be the following with Scala's type inference:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s, t) => s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
For integers there is List.sorted, just use this:
val list = List(4, 3, 2, 1)
val sortedList = list.sorted
println(sortedList)
just check the docs
List has several methods for sorting. myList.sorted works for types with already defined order (like Int or String and others). myList.sortWith and myList.sortBy receive a function that helps defining the order
Also, first link on google for scala List sort: http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered
you can use List(1 to 400000).sorted

Sorting three lists using the ordering from one of them in Scala

Given three distinct lists of the same length, I want to sort all three of them, using the ordering from one of them. For example, for the given three lists:
val a = Seq(2, 1, 3)
val b = Seq("Hi", "there", "world")
val c = Seq(1.0, 2.0, 3.0)
...and assuming that we sort by the ordering from a, I want the result to look like this:
aSorted: Seq[Int] = List(1, 2, 3) // Sorted by its own order
bSorted: Seq[String] = List("there", "Hi", "world") // Reordered the same way as aSorted
cSorted: Seq[Double] = List(2.0, 1.0, 3.0) // Reordered the same way as aSorted
All functions from Sorting appear to work on sequences without any way to specify a swap operation. So do I have to resort to writing my own code to sort? Or should I implement some custom sequence type? If so, how?
You can do this pretty cleanly with zip, sortBy, and unzip.
val (aSorted, pair) = a.zip(b.zip(c)).sortBy(_._1).unzip
val (bSorted, cSorted) = pair.unzip
zip takes two sequences and returns a sequence of pairs (dropping any extra elements if the lengths don't match). This means b.zip(c) is a sequence of (String, Double) elements, and a.zip(b.zip(c)) is a sequence of (Int, (String, Double)).
We can then use sortBy(_._1) to sort this sequence by the elements from the first sequence.
Lastly unzip just undoes zip, turning a sequence of (Int, (String, Double)) into a pair of sequences—one of Int elements and one of (String, Double) elements. Then we just do the same operation again on the second of these two sequences, and you've got the result you want.

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))

In Scala, how to get a slice of a list from nth element to the end of the list without knowing the length?

I'm looking for an elegant way to get a slice of a list from element n onwards without having to specify the length of the list. Lets say we have a multiline string which I split into lines and then want to get a list of all lines from line 3 onwards:
string.split("\n").slice(3,X) // But I don't know what X is...
What I'm really interested in here is whether there's a way to get hold of a reference of the list returned by the split call so that its length can be substituted into X at the time of the slice call, kind of like a fancy _ (in which case it would read as slice(3,_.length)) ? In python one doesn't need to specify the last element of the slice.
Of course I could solve this by using a temp variable after the split, or creating a helper function with a nice syntax, but I'm just curious.
Just drop first n elements you don't need:
List(1,2,3,4).drop(2)
res0: List[Int] = List(3, 4)
or in your case:
string.split("\n").drop(2)
There is also paired method .take(n) that do the opposite thing, you can think of it as .slice(0,n).
In case you need both parts, use .splitAt:
val (left, right) = List(1,2,3,4).splitAt(2)
left: List[Int] = List(1, 2)
right: List[Int] = List(3, 4)
The right answer is takeRight(n):
"communism is sharing => resource saver".takeRight(3)
//> res0: String = ver
You can use scala's list method 'takeRight',This will not throw exception when List's length is not enough, Like this:
val t = List(1,2,3,4,5);
t.takeRight(3);
res1: List[Int] = List(3,4,5)
If list is not longer than you want take, this will not throw Exception:
val t = List(4,5);
t.takeRight(3);
res1: List[Int] = List(4,5)
get last 2 elements:
List(1,2,3,4,5).reverseIterator.take(2)