MapReduce - Reducing to Specify last colum - scala

I have this code:
1. var data = sc.textFile("test3.tsv")
2. var satir = data.map(line=> ((line.split("\t")(1),line.split("\t")(2)),(1,1)))
3. satir.reduce(((a,b),(c,k)) => k + k)
First and second works properly. What I want is by reducing (a,b), specify last '1'
For example, like this:
((a,b),(1,1))
But when I compile third one I get this error:
<console>:29: error: type mismatch;
found : (Int, Int)
required: String
satir.reduce({ case ((a,b),(k,o)) =>o+o})
What should I do?

When you reduce a value, the output value type must be the same as the input value type, you can use a folding method instead, because you can return another type with him.
scala.io.Source.fromFile("test3.tsv")
.getLines
.toList
.map { line =>
val value = line.split("\t")
((value(0), value(1)), (1,1))
}
.foldLeft(0)((response, tuple) => tuple._2._2 + tuple._2._2)
If you care to understand the theory behind this:
Fold explanation
A tutorial on the universality and
expressiveness of fold
A Translation from Attribute Grammars to
Catamorphisms

Related

Checking Array content equality in linear time in Scala

I want to check if 2 arrays (target and arr) of integers have the same content (both arrays are unordered). Example: Array(1,1,3,4,1,5) contents are equal to Array(1,1,1,5,4,3) contents. I'm aware of the sorting solution and I'm trying to work out the linear time solution. I'm starting by getting the profile of the target array by target.groupBy(identity) and then folding the destination array using that profile:
def canBeEqual(target: Array[Int], arr: Array[Int]): Boolean = {
import scala.collection.mutable.Map
var profile = target.groupBy(identity).view.mapValues(_.length).toMap
var mutableProfile = Map(profile.toSeq: _*)
arr.fold(mutableProfile)((p, x) => {p(x) = p(x) - 1; p}).isEmpty
}
The problems I faced:
The default Map is immutable, I rebuilt the profile map using the mutable map. How would that affect performance?
I'm getting this compilation error:
Line 5: error: value update is not a member of Any (in solution.scala)
arr.fold(mutableProfile)((p, x) => {p(x) = p(x) - 1; p}).isEmpty
^
I added the types arr.fold(mutableProfile)((p:Map[Int,Int], x:Int) => {p(x) = p(x) - 1; p}).isEmpty and it failed with this error:
Line 5: error: type mismatch; (in solution.scala)
found : (scala.collection.mutable.Map[Int,Int], Int) => scala.collection.mutable.Map[Int,Int]
required: (Any, Any) => Any
I'm currently stuck at this point. Can't figure out the issue with the mismatching types.
Also any recommendations on how to approach this problem idiomatically (and efficiently) are appreciated.
Disclaimer 1: I'm a Scala beginner. Started reading about Scala last week.
Disclaimer 2: Above question is a leetcode problem #1460.
Not going by algorithm modifications, but to clear the compilation errors:
def canBeEqual(target: Array[Int], arr: Array[Int]): Boolean = {
if (target.length != arr.length) {
false
} else {
var profile = target.groupBy(identity).mapValues(_.length)
arr.forall(x =>
profile.get(x) match {
case Some(e) if e >= 1 =>
profile += (x -> (e - 1))
true
case Some(_) => false
case None => false
})
}
}
Note that adding or removing from an immutable HashMap takes constant time. (ref: https://docs.scala-lang.org/overviews/collections/performance-characteristics.html)
forall preferred over fold as one can break in between when the condition does not match. (try putting a print statements in match to validate)
Or you can do:
target.groupBy(identity).mapValues(_.length) == arr.groupBy(identity).mapValues(_.length)

How do I use pattern matching for elements within a list?

I'm working on Advent of Code's coding challenges and I'm on day one. I've read from a file that contains nothing but ((()(())(( so I'm looking to turn each '(' to 1 and each ')' to a -1 so I can compute them. But I'm having issues when I map findFloor over source. I'm getting a type mismatch. Everything looks right to me and that's the weird part because it's not working.
import scala.io._
object Advent1 extends App {
// Read from file
val source = Source.fromFile("floor1-Input.txt").toList
// Replace each '(' with 1 and each ')' with -1, return List[Int]
def findFloor(input: List[Char]):Int = input match {
case _ if input.contains('(') => 1
case _ if input.contains(')') => -1
}
val floor = source.map(findFloor)
}
Error output
error: type mismatch;
found : List[Char] => Int
required: Char => ?
val floor = source.map(findFloor)
^ one error found
What I'm I doing wrong here ? / what I'm I missing ?
Scala map works over an elements rather than whole collection. Try this:
val floor = source.map {
case '(' => 1
case ')' => -1
}.sum
If you want to compute them in an sequential way, you could even do the computations directly by using foldLeft.
val computation = source.foldLeft(0)( (a, b) => {
b match {
case '(' => a + 1
case ')' => a - 1
}
})
It's simply adding up all values and returning the acummulated value. For ( it's adding and for ')' it's substracting.
The first argument is the starting value, a is the value from the previous step and b is the actual element therefore the char.
You're error is probably caused because the pattern match is not defined for all chars. You missing a match for all others, case _ => 0 for example.
An other option would be to use 'collect' since this accepts a PartialFunction and ignores all not matching elements.
The suggested 'fold' solution is the better approach here I think.

How to find unique elements from list of tuples based on some elements using scala?

I have following list
val a = List(("name1","add1","city1",10),("name1","add1","city1",10),
("name2","add2","city2",10),("name2","add2","city2",20),("name3","add3","city3",20))
I want distinct element from above list based on first three values of tuple. Fourth value should not be consider while finding distinct elements from list.
I want following output:
val output = List(("name1","add1","city1",10),("name2","add2","city2",10),
("name3","add3","city3",20))
Is it possible to get above output?
As per my knowledge, distinct works if whole tuple/value is duplicated. I tried out with distinct like following:
val b = List(("name1","add1","city1",10),("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20)).distinct
but it gives output as -
List(("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20))
Any alternate approach will also appreciated.
Use groupBy like this
a.groupBy( v => (v._1,v._2,v._3)).keys.toList
This constructs a Map where each key is by definition a unique triplet as required in the lambda function above.
Should it include also the last element in the tuple, fetch the first element for each key, like this
a.groupBy( v => (v._1,v._2,v._3)).mapValues(_.head)
If the order of the output list isn't important (i.e. you are happy to get List(("name3","add3","city3",20),("name1","add1","city1",10),("name2","add2","city2",10))), the following works as specified:
a.groupBy(v => (v._1,v._2,v._3)).values.map(_.head).toList
(Due to Scala collections design, you'll see the order kept for output lists up to 4 elements, but above that size HashMap will be used.) If you do need to keep the order, you can do something like (generalizing a bit)
def distinctBy[A, B](xs: Seq[A], f: A => B) = {
val seen = LinkedHashMap.empty[B, A]
xs.foreach { x =>
val key = f(x)
if (!seen.contains(key)) { seen.update(key, x) }
}
seen.values.toList
}
distinctBy(a, v => (v._1, v._2, v._3))
You could try
a.map{case x#(name, add, city, _) => (name,add,city) -> x}.toMap.values.toList
To make sure you have the first one in list kept,
type String3 = (String, String, String)
type String3Int = (String, String, String, Int)
a.foldLeft(collection.immutable.ListMap.empty[String3, String3Int]) {
case (a, b) => if (a.contains((b._1, b._2, b._3))) {
a
} else a + ((b._1, b._2, b._3) -> b)
}.values.toList
On simple solution would be to convert the List to a Set. Sets don't contain duplicates: check the documentation.
val setOfTuples = a.toSet
println(setOfTuples)
Output: Set((1,1), (1,2), (1,3), (2,1))

Running a function against every item in collection

I have data type :
counted: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = MapPartitionsRDD[24] at groupByKey at <console>:28
And I'm trying to apply the following to this type :
def func = 2
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
So each sequence is compared to each other and a function is applied. For simplicity the function is just returning 2. When I attempt above function I receive this error :
scala> counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
<console>:33: error: type mismatch;
found : org.apache.spark.rdd.RDD[(String, Int)]
required: TraversableOnce[?]
counted.flatMap { x => counted.map { y => ((x._1+","+y._1),func) } }
How can this function be applied using Spark ?
I have tried
val dataArray = counted.collect
dataArray.flatMap { x => dataArray.map { y => ((x._1+","+y._1),func) } }
which converts the collection to Array type and applies same function. But I run out of memory when I try this method. I think using an RDD is more efficient than using an Array ? The max amount of memory I can allocate is 7g , is there a mechanism in spark that I can use hard drive memory to augment available RAM memory ?
The collection I'm running this function on contain 20'000 entries so 20'000^2 comparisons (400'000'000) but in Spark terms this is quite small ?
Short answer:
counted.cartesian(counted).map {
case ((x, _), (y, _)) => (x + "," + y, func)
}
Please use pattern matching to extract tuple elements for nested tuples to avoid unreadable chained underscore notation. Using _ for the second elements shows the reader that these values are being ignored.
Now what would be even more readable (and maybe more efficient) if func doesn't use the second elements would be to do this:
val projected = counted.map(_._1)
projected.cartesian(projected).map(x => (x._1 + "," + x._2, func))
Note that you do not need curly braces if your lambda fits in a single semantic line this is a very common mistake in Scala.
I would like to know why you wish to have this Cartesian product, there is often ways to avoid doing this that are significantly more scalable. Please say what your going to do with this Cartesian product and I will try to find a scalable way of doing what you want.
One final point; please put spaces between operators
#RexKerr pointed to me that I was somewhat incorrect in the comment section, so I deleted my comments. But while doing that, I had the chance to read the post again and came up with the idea that might be of some use to you.
Since what you are trying to implement is actually some operation over a cartesian product, you might want to try just calling the RDD#cartesian. Here is a dumb example, but if you can give some real code, maybe I'll be able to do something like this in that case as well:
// get collection with the type corresponding to the type in question:
val v1 = sc.parallelize(List("q"-> (".", 0), "s"->(".", 1), "f" -> (".", 2))).groupByKey
// try doing something
v1.cartesian(v1).map{x => (x._1._1+","+x._1._1, 2)}.foreach(println)

JSON to XML in Scala and dealing with Option() result

Consider the following from the Scala interpreter:
scala> JSON.parseFull("""{"name":"jack","greeting":"hello world"}""")
res6: Option[Any] = Some(Map(name -> jack, greeting -> hello world))
Why is the Map returned in Some() thing? And how do I work with it?
I want to put the values in an xml template:
<test>
<name>name goes here</name>
<greeting>greeting goes here</greeting>
</test>
What is the Scala way of getting my map out of Some(thing) and getting those values in the xml?
You should probably use something like this:
res6 collect { case x: Map[String, String] => renderXml(x) }
Where:
def renderXml(m: Map[String, String]) =
<test><name>{m.get("name") getOrElse ""}</name></test>
The collect method on Option[A] takes a PartialFunction[A, B] and is a combination of filter (by a predicate) and map (by a function). That is:
opt collect pf
opt filter (a => pf isDefinedAt a) map (a => pf(a))
Are both equivalent. When you have an optional value, you should use map, flatMap, filter, collect etc to transform the option in your program, avoiding extracting the option's contents either via a pattern-match or via the get method. You should never, ever use Option.get - it is the canonical sign that you are doing it wrong. Pattern-matching should be avoided because it represents a fork in your program and hence adds to cyclomatic complexity - the only time you might wish to do this might be for performance
Actually you have the issue that the result of the parseJSON method is an Option[Any] (the reason is that it is an Option, presumably, is that the parsing may not succeed and Option is a more graceful way of handling null than, well, null).
But the issue with my code above is that the case x: Map[String, String] cannot be checked at runtime due to type erasure (i.e. scala can check that the option contains a Map but not that the Map's type parameters are both String. The code will get you an unchecked warning.
An Option is returned because parseFull has different possible return values depending on the input, or it may fail to parse the input at all (giving None). So, aside from an optional Map which associates keys with values, an optional List can be returned as well if the JSON string denoted an array.
Example:
scala> import scala.util.parsing.json.JSON._
import scala.util.parsing.json.JSON._
scala> parseFull("""{"name":"jack"}""")
res4: Option[Any] = Some(Map(name -> jack))
scala> parseFull("""[ 100, 200, 300 ]""")
res6: Option[Any] = Some(List(100.0, 200.0, 300.0))
You might need pattern matching in order to achieve what you want, like so:
scala> parseFull("""{"name":"jack","greeting":"hello world"}""") match {
| case Some(m) => Console println ("Got a map: " + m)
| case _ =>
| }
Got a map: Map(name -> jack, greeting -> hello world)
Now, if you want to generate XML output, you can use the above to iterate over the key/value pairs:
import scala.xml.XML
parseFull("""{"name":"jack","greeting":"hello world"}""") match {
case Some(m: Map[_,_]) =>
<test>
{
m map { case (k,v) =>
XML.loadString("<%s>%s</%s>".format(k,v,k))
}
}
</test>
case _ =>
}
parseFull returns an Option because the string may not be valid JSON (in which case it will return None instead of Some).
The usual way to get the value out of a Some is to pattern match against it like this:
result match {
case Some(map) =>
doSomethingWith(map)
case None =>
handleTheError()
}
If you're certain the input will always be valid and so you don't need to handle the case of invalid input, you can use the get method on the Option, which will throw an exception when called on None.
You have two separate problems.
It's typed as Any.
Your data is inside an Option and a Map.
Let's suppose we have the data:
val x: Option[Any] = Some(Map("name" -> "jack", "greeting" -> "hi"))
and suppose that we want to return the appropriate XML if there is something to return, but not otherwise. Then we can use collect to gather those parts that we know how to deal with:
val y = x collect {
case m: Map[_,_] => m collect {
case (key: String, value: String) => key -> value
}
}
(note how we've taken each entry in the map apart to make sure it maps a string to a string--we wouldn't know how to proceed otherwise. We get:
y: Option[scala.collection.immutable.Map[String,String]] =
Some(Map(name -> jack, greeting -> hi))
Okay, that's better! Now if you know which fields you want in your XML, you can ask for them:
val z = for (m <- y; name <- m.get("name"); greet <- m.get("greeting")) yield {
<test><name>{name}</name><greeting>{greet}</greeting></test>
}
which in this (successful) case produces
z: Option[scala.xml.Elem] =
Some(<test><name>jack</name><greeting>hi</greeting></test>)
and in an unsuccessful case would produce None.
If you instead want to wrap whatever you happen to find in your map in the form <key>value</key>, it's a bit more work because Scala doesn't have a good abstraction for tags:
val z = for (m <- y) yield <test>{ m.map { case (tag, text) => xml.Elem(null, tag, xml.Null, xml.TopScope, xml.Text(text)) }}</test>
which again produces
z: Option[scala.xml.Elem] =
Some(<test><name>jack</name><greeting>hi</greeting></test>)
(You can use get to get the contents of an Option, but it will throw an exception if the Option is empty (i.e. None).)