Strange behavior in Scala Parallel View - scala

According to Scaladoc,
A view is a lazy version of some collection. Collection transformers such as map or filter or ++ do not traverse any elements when applied on a view. Instead they create a new view which simply records that fact that the operation needs to be applied.
That means the operations won't be applied until the elements is accessed. But how about Parallel ?
Take a look at this example:
def tn = Thread.currentThread.getName
val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")
val pvs = strList.par.view.filter{ s => println("f "+ tn); s == "I"}.map{s => println("m " + tn); s.toLowerCase}
The second will print out like the following :
When you apply foreach on pvs , it outputs :
I can't understand why the performance of Parallel style is not the same as the normal one :
val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace") // or read from a text file , e.g. article.txt
strList.view.filter{s => println("f"); s == "I"}.map{s => println("m"); s.toLowerCase}.foreach(s => println("p"))

Because the interpreter evaluates the expression by forcing it if the expression is a parallel collection view, so that it could print it (in essence, forces the view). Try either running this as a standalone Scala program, or do this:
scala> object foo { var bar: AnyRef = null }
scala> foo.bar = strList.par.view.filter{ s => println("f "+ tn); s == "I"}.map{s => println("m " + tn); s.toLowerCase}
EDIT:
Another issue above is the filter method on parallel views - unlike regular views, it is implemented by forcing the collection. This means that the moment you call filter on the parallel view, the entire filtered collection will be forced into an array and the predicate associated with the filter will have to be called. Methods like groupBy do the same thing on regular views.

Related

Scala: Compare elements at same position in two arrays

I'm in the process of learning Scala and am trying to write some sort of function that will compare one element in an list against an element in another list at the same index. I know that there has to be a more Scalatic way to do this than two write two for loops and keep track of the current index of each manually.
Let's say that we're comparing URLs, for example. Say that we have the following two Lists that are URLs split by the / character:
val incomingUrl = List("users", "profile", "12345")
and
val urlToCompare = List("users", "profile", ":id")
Say that I want to treat any element that begins with the : character as a match, but any element that does not begin with a : will not be a match.
What is the best and most Scalatic way to go about doing this comparison?
Coming from a OOP background, I would immediately jump to a for loop, but I know that there has to be a good FP way to go about it that will teach me a thing or two about Scala.
EDIT
For completion, I found this outdated question shortly after posting mine that relates to the problem.
EDIT 2
The implementation that I chose for this specific use case:
def doRoutesMatch(incomingURL: List[String], urlToCompare: List[String]): Boolean = {
// if the lengths don't match, return immediately
if (incomingURL.length != urlToCompare.length) return false
// merge the lists into a tuple
urlToCompare.zip(incomingURL)
// iterate over it
.foreach {
// get each path
case (existingPath, pathToCompare) =>
if (
// check if this is some value supplied to the url, such as `:id`
existingPath(0) != ':' &&
// if this isn't a placeholder for a value that the route needs, then check if the strings are equal
p2 != p1
)
// if neither matches, it doesn't match the existing route
return false
}
// return true if a `false` didn't get returned in the above foreach loop
true
}
You can use zip, that invoked on Seq[A] with Seq[B] results in Seq[(A, B)]. In other words it creates a sequence with tuples with elements of both sequences:
incomingUrl.zip(urlToCompare).map { case(incoming, pattern) => f(incoming, pattern) }
There is already another answer to the question, but I am adding another one since there is one corner case to watch out for. If you don't know the lengths of the two Lists, you need zipAll. Since zipAll needs a default value to insert if no corresponding element exists in the List, I am first wrapping every element in a Some, and then performing the zipAll.
object ZipAllTest extends App {
val incomingUrl = List("users", "profile", "12345", "extra")
val urlToCompare = List("users", "profile", ":id")
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
One thing that might bother you is that we are making several passes through the data. If that's something you are worried about, you can use lazy collections or else write a custom fold that makes only one pass over the data. That is probably overkill though. If someone wants to, they can add those alternative implementations in another answer.
Since the OP is curious to see how we would use lazy collections or a custom fold to do the same thing, I have included a separate answer with those implementations.
The first implementation uses lazy collections. Note that lazy collections have poor cache properties so that in practice, it often does does not make sense to use lazy collections as a micro-optimization. Although lazy collections will minimize the number of times you traverse the data, as has already been mentioned, the underlying data structure does not have good cache locality. To understand why lazy collections minimize the number of passes you make over the data, read chapter 5 of Functional Programming in Scala.
object LazyZipTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").view
val urlToCompare = List("users", "profile", ":id").view
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
The second implementation uses a custom fold to go over lists only one time. Since we are appending to the rear of our data structure, we want to use IndexedSeq, not List. You should rarely be using List anyway. Otherwise, if you are going to convert from List to IndexedSeq, you are actually making one additional pass over the data, in which case, you might as well not bother and just use the naive implementation I already wrote in the other answer.
Here is the custom fold.
object FoldTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").toIndexedSeq
val urlToCompare = List("users", "profile", ":id").toIndexedSeq
def onePassZip[T, U](l1: IndexedSeq[T], l2: IndexedSeq[U]): IndexedSeq[(Option[T], Option[U])] = {
val folded = l1.foldLeft((l2, IndexedSeq[(Option[T], Option[U])]())) { (acc, e) =>
acc._1 match {
case x +: xs => (xs, acc._2 :+ (Some(e), Some(x)))
case IndexedSeq() => (IndexedSeq(), acc._2 :+ (Some(e), None))
}
}
folded._2 ++ folded._1.map(x => (None, Some(x)))
}
println(onePassZip(incomingUrl, urlToCompare))
println(onePassZip(urlToCompare, incomingUrl))
}
If you have any questions, I can answer them in the comments section.

How can I take the highest ranking filter condition that ends up matching in a dataframe?

Wording of my question might be confusing so let me explain. Say I have an array of strings. They are ranked in order of the best case scenario match. So at index 0 we want this to always exist in the dataframe column, but if it doesn't then index 1 is the next best option. I have written this logic like this, but I don't feel like this is the most efficient way to have done this. Is there another way of doing it that is better?
The datasets are quite small, but I fear this can't really scale very well.
val df = spark.createDataFrame(data)
val nameArray = Array[String]("Name", "Name%", "%Name%", "Person Name", "Person Name%", "%Person Name%")
nameArray.foreach(x => {
val nameDf = df.where("text like '" + x + "'")
if(nameDf.count() > 0){
nameDf.show(1)
break()
}
})
If values are order according to preference from left (highest precedence) to right (lowest precedence) and lower precedence patterns already cover higher precedence ones (it doesn't look like it is the case in your example) you generate expression like this
import org.apache.spark.sql._
def matched(df: DataFrame, nameArray: Seq[String], c: String = "text") = {
val matchIdx = nameArray.zipWithIndex.foldRight(lit(-1)){
case ((s, i), acc) => when(col(c) like s, lit(i)).otherwise(acc)
}
df.select(max(matchIdx)).first match {
case Row(-1) => None // No pattern matches all records
case Row(i: Int) => Some(nameArray(i))
}
}
Usage examples:
matched(Seq("Some Name", "Name", "Name Surname").toDF("text"), Seq("Name", "Name%", "%Name%"))
// Option[String] = Some(%Name%)
There are two advantages of this method:
Only one action is required.
Pattern matching can be short circuited.
If pre-conditions are not satisfied you can
import org.apache.spark.sql.functions._
val unmatchedCount: Map[String, Long] = df.select(
nameArray.map(s => count(when(not($"text" like s), 1)).alias(s)): _*
).first.getValuesMap[Long](nameArray)
Unlike the first approach it will check all patterns, but it requires only one action.

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Concatenation of Sets in Scala, unexpected results

The following is a bit of code adapted from an online Scala tutorial:
object Test {
def main(args: Array[String]) {
val fruit1 = List("apples", "oranges", "pears")
val fruit2 = List("mangoes", "banana")
val veg1 = Set("potatoes", "carrots", "peas")
val veg2 = Set("cabbage", "squash")
var fruit = fruit1 ++ fruit2
println( "fruit1 ++ fruit2 : " + fruit )
var vegetable = veg1 ++ veg2
println( "veg1 ++ veg2 : " + vegetable )
}
}
And here are the results:
fruit1 ++ fruit2 : List(apples, oranges, pears, mangoes, banana)
veg1 ++ veg2 : Set(cabbage, peas, squash, carrots, potatoes)
The List concatenation works as I expected, but I would never have predicted the resulting order of the Set concatenation. Can anyone explain how the results are arrived at? The online tutorial is not much help here.
scala.collection.Set doesn't provide any guarantees about traversal order—if you ask for a string representation of one, the order you get is unspecified and depends on the implementation.
Some subclasses of scala.collection.Set do provide guarantees about order—e.g. the ordering you see when iterating through a collection.immutable.SortedSet will depend on the Ordering instance that you used to construct the set, and collection.mutable.LinkedHashSet will iterate through its elements in insertion order. When you're working with an arbitrary Set, though, you should never, ever, rely on the order that you happen to see when using toString, foreach, etc.
Note that since Set is an unordered data structure, implementations are free to use any order they like.
The default implementation of Set uses a HashSet under the hood. This is implemented using a hashing approach which makes O(n) to determine whether an item exists in the Set. The ordering on output is probably the order of the hashes generated per item, which is effectively random.
An alternative implementation is TreeSet, which stores the data as a tree, so is O(log(n)) to determine whether an item exists in the set. In this case, the output ordering is the order of the items as they would be in a Tree.
The following code demonstrates this:
import scala.collection.immutable.{TreeSet, HashSet}
val veg1 = HashSet("potatoes", "carrots", "peas")
val veg2 = HashSet("cabbage", "squash")
val vegetable = veg1 ++ veg2
println( "veg1 ++ veg2 : " + vegetable )
//veg1 ++ veg2 : Set(cabbage, peas, squash, carrots, potatoes)
val trVeg1 = TreeSet("potatoes", "carrots", "peas")
val trVeg2 = TreeSet("cabbage", "squash")
val trVegetable = trVeg1 ++ trVeg2
println( "trVeg1 ++ trVeg2 : " + trVegetable )
//trVeg1 ++ trVeg2 : TreeSet(cabbage, carrots, peas, potatoes, squash)
EDIT:
As pointed out in the comments by Travis Brown, it's not exactly true that the "default implementation" of Set is a HashSet. Scala has specific implementations of Set when there's a small number of items, However when a set of more than 4 items is created, or when the ++ operator is used on instances of Set() to create a Set of more than 4 items, the result is a HashSet:
val set1 = Set("abc", "def", "ghi")
val set2 = Set("abcd", "defg", "ghij")
set1.getClass
//scala.collection.immutable.Set$Set3
set2.getClass
//scala.collection.immutable.Set$Set3
val concSet = set1 ++ set2
concSet.getClass
//scala.collection.immutable.HashSet$HashTrieSet
The small Set implementations only go up to Sets of 4 items, so:
val set3 = Set("abcd", "defg", "ghij", "fghtre", "dfghdd")
set3.getClass
//scala.collection.immutable.HashSet$HashTrieSet
https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Set.scala#L184

How to create a List of Wildcard elements Scala

I'm trying to write a function that returns a list (for querying purposes) that has some wildcard elements:
def createPattern(query: List[(String,String)]) = {
val l = List[(_,_,_,_,_,_,_)]
var iter = query
while(iter != null) {
val x = iter.head._1 match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
l(x) = iter.head._2
iter = iter.tail
}
l
}
So, the user enters some query terms as a list. The function parses through these terms and inserts them into val l. The fields that the user doesn't specify are entered as wildcards.
Val l is causing me troubles. Am I going the right route or are there better ways to do this?
Thanks!
Gosh, where to start. I'd begin by getting an IDE (IntelliJ / Eclipse) which will tell you when you're writing nonsense and why.
Read up on how List works. It's an immutable linked list so your attempts to update by index are very misguided.
Don't use tuples - use case classes.
You shouldn't ever need to use null and I guess here you mean Nil.
Don't use var and while - use for-expression, or the relevant higher-order functions foreach, map etc.
Your code doesn't make much sense as it is, but it seems you're trying to return a 7-element list with the second element of each tuple in the input list mapped via a lookup to position in the output list.
To improve it... don't do that. What you're doing (as programmers have done since arrays were invented) is to use the index as a crude proxy for a Map from Int to whatever. What you want is an actual Map. I don't know what you want to do with it, but wouldn't it be nicer if it were from these key strings themselves, rather than by a number? If so, you can simplify your whole method to
def createPattern(query: List[(String,String)]) = query.toMap
at which point you should realise you probably don't need the method at all, since you can just use toMap at the call site.
If you insist on using an Int index, you could write
def createPattern(query: List[(String,String)]) = {
def intVal(x: String) = x match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
val tuples = for ((key, value) <- query) yield (intVal(key), value)
tuples.toMap
}
Not sure what you want to do with the resulting list, but you can't create a List of wildcards like that.
What do you want to do with the resulting list, and what type should it be?
Here's how you might build something if you wanted the result to be a List[String], and if you wanted wildcards to be "*":
def createPattern(query:List[(String,String)]) = {
val wildcard = "*"
def orElseWildcard(key:String) = query.find(_._1 == key).getOrElse("",wildcard)._2
orElseWildcard("userID") ::
orElseWildcard("userName") ::
orElseWildcard("email") ::
orElseWildcard("userPassword") ::
orElseWildcard("creationDate") ::
orElseWildcard("lastLoginDate") ::
orElseWildcard("removed") ::
Nil
}
You're not using List, Tuple, iterator, or wild-cards correctly.
I'd take a different approach - maybe something like this:
case class Pattern ( valueMap:Map[String,String] ) {
def this( valueList:List[(String,String)] ) = this( valueList.toMap )
val Seq(
userId,userName,email,userPassword,creationDate,
lastLoginDate,removed
):Seq[Option[String]] = Seq( "userId", "userName",
"email", "userPassword", "creationDate", "lastLoginDate",
"removed" ).map( valueMap.get(_) )
}
Then you can do something like this:
scala> val pattern = new Pattern( List( "userId" -> "Fred" ) )
pattern: Pattern = Pattern(Map(userId -> Fred))
scala> pattern.email
res2: Option[String] = None
scala> pattern.userId
res3: Option[String] = Some(Fred)
, or just use the map directly.