how to print a rdd correctly - scala

excuse me, I'm a new learner of spark, now I want to print a rdd in a right format, but the result is like that:
(200412169,([Ljava.lang.String;#7515eb2d,[Ljava.lang.String;#72031368))
(200412169,([Ljava.lang.String;#7515eb2d,[Ljava.lang.String;#27ef4b52))
my rdd is
Array[(String, (Array[String], Array[String]))] =
Array(
(200412169,(Array(gavin),Array(1, 24, 60, 85, 78))),
(200412169,(Array(gavin),Array(2, 22, 20, 85, 78))),
(200412166,(Array(gavin3),Array(1, 54, 80, 78, 98))),
)
and I want to print it like that:
200412169 gavin 2 22 20 85 78
200412169 gavin 1 24 60 85 78
is someone can help me, thanks very much.

The odd-looking print is the result of calling toString on a Java Array. To get a nice tab-separated printout, you can map each record into a String formatted to your liking, something like:
rdd.map { case (a, (arr1, arr2)) => (a +: arr1) ++ arr2 } // "flatten" into single array
.map(_.mkString("\t")) // combine into Tab-separated string
.foreach(println)
// 200412166 gavin3 1 54 80 78 98
// 200412169 gavin 2 22 20 85 78
// 200412169 gavin 1 24 60 85 78
Alternatively, if you do want to keep the RDD's structure, just see a proper representation of it when printing, you can simply convert the Arrays (with their not-so-useful toString) with Scala Lists:
rdd.map { case (a, (arr1, arr2)) => (a, arr1.toList, arr2.toList) }
.foreach(println)
// (200412169,List(gavin),List(1, 24, 60, 85, 78))
// (200412166,List(gavin3),List(1, 54, 80, 78, 98))
// (200412169,List(gavin),List(2, 22, 20, 85, 78))

You are viewing the result (200412169,([Ljava.lang.String;#7515eb2d,[Ljava.lang.String;#72031368))
is only because its calling tostring but in Scala to view the result of RDDyou have to use mkString .
If you want to view the content of a RDD, one way is to use collect()
myRDD.collect().foreach(println)
when the RDD has more of lines use take() to just print few .
myRDD.take(n).foreach(println)
Example:
val input=sc.parallelize(List(1,2,3,4,5))
print(input.collect().mkString(","))
Result:

Related

Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?

I would like to create a Row with multiple arguments without knowing their number. I wrote something like this in Scala:
def customRow(balance: Int,
globalGrade: Int,
indicators: Double*): Row = {
Row(
balance,
globalGrade,
indicators:_*
)
}
On Spark GitHub, the Row object seems to accept the :_* notation considering its apply method:
def apply(values: Any*): Row = new GenericRow(values.toArray)
But at compilation time, this doesn't seem to be allowed:
Error:(212, 19) no ': _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
indicators:_*
What did I miss?
This minimal example may explain better why what you want to do is not allowed:
def f(a: Int, b: Int, c: Int, rest: Int*) = a + b + c + rest.sum
val ns = List(99, 88, 77)
f(11, 22, 33, 44, ns:_*) // Illegal
f(11, 22, 33, ns:_*) // Legal
f(11, 22, ns:_*) // Illegal
Basically, you can use the :_* syntax only to pass a sequence directly as the vararg parameter rest, but it's all-or-nothing. The sequence's items are not shared out between simple and vararg parameters, and the vararg parameter cannot gather values from both the simple arguments and the provided sequence.
In your case, you are trying to call Row as if it had two simple parameters and then a vararg one, but that's not the case. When you create the sequence yourself, you are making it fit correctly into the signature.
Note that in dynamically-typed programming languages, this is typically not an issue. For example, in Python:
>>> def f(a, b, c, *rest):
return a + b + c + sum(rest)
>>> ns = [99, 88, 77]
>>> f(11, 22, 33, 44, *ns)
374
>>> f(11, 22, 33, *ns)
330
>>> f(11, 22, *ns)
297
Resolved it by adding an intermediary Seq :
def customRow(balance: Int,
globalGrade: Int,
indicators: Double*): Row = {
val args = Seq(
balance,
globalGrade
) ++ indicators
Row(
args:_*
)
}
But still, I do not know why it works.

Scala return prime numbers from Array

I'm quite new to Scala so apologies for the very basic question.
I have this great on liner that checks if a number is a prime. What I'm trying to do with it is allowing the function to take in an Array and spit out the out the prime numbers.
How can I best achieve this? Is it possible to do so in a one liner as well? Thanks!
def isPrime(num: Int): Boolean = (2 to num) forall (x => num % x != 0)
I'm trying to do with it is allowing the function to take in an Array and spit out the out the prime numbers
You can do the following
def primeNumbs(numbers: Array[Int]) = numbers.filter(x => !((2 until x-1) exists (x % _ == 0)) && x > 1)
and if you pass in array of numbers as
println(primeNumbs(Array(1,2,3,6,7,10,11)).toList)
You should be getting
List(2, 3, 7, 11)
I hope the answer is helpful
Note: your isPrime function doesn't work
You can use this method
def isPrime(num : Int) : Boolean = {
((1 to num).filter(e => (num % e == 0)).size) == 2
}
isPrime: (num: Int)Boolean
scala> (1 to 100) filter(isPrime(_)) foreach(e=> print(e+" "))
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
Your isPrime seems completely broken.
Even if you replace to by until, it will still return strange results for 0 and 1.
Here is a very simple implementation that returns the correct results for 0 and 1, and checks only divisors smaller than (approximately) sqrt(n) instead of n:
def isPrime(n: Int) =
n == 2 ||
(n > 1 && (2 to (math.sqrt(n).toInt + 1)).forall(n % _ > 0))
Now you can filter primes from a range (or a list):
(0 to 10000).filter(isPrime).foreach(println)
You could also write it like this:
0 to 10000 filter isPrime foreach println
But this version with explicit lambdas probably generalizes better, even though it's not necessary in this particular case:
(0 to 10000).filter(n => isPrime(n)).foreach(n => println(n))
In understand that the prime function may be the objective of your assignment/task/interest, but note that's already available in the JVM as BigInteger.isProbablePrime(). With that, and the fact that Scala can call Java transparently, try the following filter:
import java.math.BigInteger
val r = (1 to 100).filter { BigInteger.valueOf(_).isProbablePrime(25) }.mkString(", ")
// "2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97"
This works by iterating the range of numbers (or your Array, or any TraversableOnce, same syntax) and letting pass only those numbers "_" in the closure that fulfill the condition, i.e. that are prime. And instead of folding with a string concatenation, there's a convenient helper mkString that inserts a separator into a sequence and produces a String for you.
And don't worry about the "probable" prime here. For such small numbers like here, there's no probability involved, despite the method name. That kicks in for numbers with maybe 30+ digits or so.

Is there a better or functional way to compare 2 values in a map in Scala?

I have functions which allow the user to compare two teams contained within a map. The data contained in the map is read in from a text file which contains information about football teams and their points tallies for the past 5 seasons. The data is stored as Map[String, List[Int]]:
Manchester United, 72, 86, 83, 90, 94
Manchester City, 80, 84, 91, 77, 88
Chelsea, 76, 85, 92, 87, 84
Arsenal, 70, 79, 81, 76, 83
Liverpool, 69, 77, 80, 73, 79
The functions below allow the user to enter the names of two teams and compare the difference between the most recent (last) points tallies for the two teams.
val teamdata = readTextFile("teams.txt")
//User presses 2 on keyboard, this invokes menuCompareTeams which invokes compareTeams
def menuOptionTwo(): Boolean = {
//2 - compare 2 teams selected by the user
menuCompareTeams(compareTeams)
true
}
//Function which displays the results of compareTeams
def menuCompareTeams(f: (String, String) => ((String, Int), (String, Int), String)) = {
val input = f(readLine("Enter first team to compare>"),
readLine("Enter second team to compare>"))
println(s"""|Team 1: ${input._1._1} - Points: ${input._1._2}
|Team 2: ${input._2._1} - Points: ${input._2._2}
|${input._3}""".stripMargin)
}
///Function which compares the 2 teams - invoked by menuCompareTeams
def compareTeams(team1: String, team2: String): ((String, Int), (String, Int), String) = {
def lastPoints(list: List[Int]): Int = list match {
case Nil => throw new Exception("Empty list")
case h :: Nil => h
case _ :: tail => lastPoints(tail)
}
val team1Points = teamdata.get(team1) match{
case Some(p) => lastPoints(p)
case None => 0
}
val team2Points = teamdata.get(team2) match{
case Some(p) => lastPoints(p)
case None => 0
}
val pointsComparison = if(team1Points > team2Points){
"The team who finished higher is: " + team1 + ", their total points tally for last season was: " + team1Points + ". There was a difference of " + (team1Points-team2Points) + " points between the two teams."
}
else if(team1Points == team2Points){
"The teams had equal points last season: " + (team1Points|team2Points)
}
else{
"The team who finished higher is: " + team2 + ", their total points tally for last season was: " + team2Points + ". There was a difference of " + (team2Points-team1Points) + " points between the two teams."
}
((team1, team1Points), (team2, team2Points), pointsComparison)
}
E.g. The correct output for when the user enters 'Manchester United' and 'Manchester City' is shown below:
Team 1: Manchester United - Points: 94
Team 2: Manchester City - Points: 88
The team who finished higher is: Manchester United, their total points tally for last season was: 94. There was a difference of 6 points between the two teams.
Is there a better or functional way to do what I am currently doing for the comparison of 2 teams?
EDIT: I have edited the question based on Alex's suggestion.
"More Functional"
The one "side effect" you do have is throwing an exception for empty lists. If that is truly an exceptional case then lastPoints should return a Try[Int]. Other than that you maintain referential transparency, use pattern matching, and use recursive functions so you can't get "more functional".
"Better"
You could use List#lastOption in lastPoints (assuming you don't throw exceptions) instead of writing your own:
val lastPoints = (_ : List[Int]).lastOption
Which could then be used as follows:
val getPoints =
(team : String) => teamdata.get(team)
.flatMap(lastPoints)
.getOrElse(0)
val team1Points = getPoints(team1)
val team2Points = getPoints(team2)
In general, I always go searching for a collection method that solves my problem before trying to "roll my own". By relying on lastOption, flatMap, and getOrElse there are fewer places for bugs to lurk.

Spark Scala Sequence Split on Integer Criteria

I am using Scala on Spark and need some help splitting a sequence of sets based on specific values within the sets.
Here is an example:
val sets = Array(Seq(Set("A", 15,20 ),Set("B", 17, 21), Set("C", 22,34)),
Seq(Set("D", 15, 20),Set("E", 17, 21), Set("F", 21, 23), Set("G", 25,34)))
I am trying to split each sequence within the array based off the criteria that the first integer within each set is between the two integers within the other sets in the same sequence and return the character value of the sets grouped together.
So for the first sequence you can see that we have the integers 15 and 20 in the first set and in the second set 17 and 21. So those sets would be grouped together because 17 is between 15 and 20 and the third set would not be left alone.
In the second sequence I have 15 and 20 overlaps with 17 and 21. Also 17 and 21 will overlap with 21 and 23 and then the last set would be left alone.
Essentially I would like to have it return Set(A, B), Set(C), Set(D, E), Set(D, F), Set(G)
I realize this is not great phrasing but if someone could give me a hand that would be very appreciated.
As noted by the zero323, Set("A", 15,20 ) should probably not be a set. I suggest converting it to a case class:
case class Item(name: String, start: Int, end: Int) {
val range = Range.inclusive(start, end)
}
With this class, if you described your problem correctly it could be solved like this:
sets.map { seq =>
seq.foldLeft(Vector[Vector[Item]]()) { (list, item) =>
list.lastOption match {
case Some(lastGroup) if lastGroup.last.range.contains(item.start) =>
list.init :+ (lastGroup :+ item)
case _ =>
list :+ Vector(item)
}
}.map(l => l.map(i => i.name).toSet)
}.flatten

Implement a partition function using a fold in Scala

I'm new to Scala and I want to write a higher-order function (say "partition2") that takes a list of integers and a function that returns either true or false. The output would be a list of values for which the function is true and a list of values for which the function is false. I'd like to implement this using a fold. I know something like this would be a really straightforward way to do this:
val (passed, failed) = List(49, 58, 76, 82, 88, 90) partition ( _ > 60 )
I'm wondering how this same logic could be applied using a fold.
You can start by thinking about what you want your accumulator to look like. In many cases it'll have the same type as the thing you want to end up with, and that works here—you can use two lists to keep track of the elements that passed and failed. Then you just need to write the cases and add the element to the appropriate list:
List(49, 58, 76, 82, 88, 90).foldRight((List.empty[Int], List.empty[Int])) {
case (i, (passed, failed)) if i > 60 => (i :: passed, failed)
case (i, (passed, failed)) => (passed, i :: failed)
}
I'm using a right fold here because prepending to a list is nicer than the alternative, but you could easily rewrite it to use a left fold.
You can do this:
List(49, 58, 76, 82, 88, 90).foldLeft((Vector.empty[Int], Vector.empty[Int])){
case ((passed, failed), x) =>
if (x > 60) (passed :+ x, failed)
else (passed, failed :+ x)
}
Basically you have two accumulators, and as you visit each element, you add it to the appropriate accumulator.