Get and order biggest tuples from list

Get and order biggest tuples from list - scala

I'm trying to order my list and get the biggest 5 tuples in my list which will then be printed out, heres the code that i have been working with:
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Word Count")
val sc = new SparkContext(conf)
val test = scala.io.Source.fromFile("/home/cloudera/Books/book1.txt").getLines
val wordCount =
test
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]) {
(count, word) =>
count + (word -> (count.getOrElse(word, 0) + 1))
}
val formatteWordCount =
filtered
.map(tuple => s"${tuple._1} -> ${tuple._2}")
.mkString("\n", "\n", "\n")
when trying to launch the code the following lines gives the error:
diverging implicit expansion for type scala.math.Ordering[B]
starting with method Tuple9 in object Ordering
.sortBy(x => (x._2))
I also tried using .stableSort(k, (x, y) => x._2 < y.2) which gave the error value stableSort is not a member of String
and .maxBy(._2) which gave the error diverging implicit expansion for type Ordering[B]
starting with method Tuple9 in object Ordering
println(s"Final Word Count: $formatteWordCount")
}

Related

Assert RDD is not sorted

I have a method called split that accepts an RDD[T] and a splitSize and returns an Array[RDD[T]].
Now, one of the test cases I write for it should verify that this function also randomly shuffles the RDD.
So I create a sorted RDD, and then see the results:
it should "randomize shuffle" in {
val inputRDD = sc.parallelize((0 until 16))
val result = RDDUtils.split(inputRDD, 2)
result.foreach(rdd => {
rdd.collect.foreach(println)
})
// Asset result is not sorted
}
If the results are:
0
1
2
3
..
15
Then it's not working as expected.
A good result can be something like:
11
3
9
14
...
1
6
How can I assert the output Array[RDD[T]]] is not sorted?

You could try something like this
val resultOrder = result.sortBy(....)
assert(!resultOrder.sameElements(result))
or
val resultOrder = result.sortBy(....)
assert(!resultOrder.toList == result.toList)
It's important to note that the key is to know how to sort the Array. For an Integer data type it would be easy, but for a complex data type you could need an implicit Ordering for your data type. e.g:
implicit val ordering: Ordering[T] =
Ordering.fromLessThan[T]((sa: T, sb: T) => sa < sb)
// OR
implicit val ordering: Ordering[MyClass] =
Ordering.fromLessThan[MyClass]((sa: MyClass, sb: MyClass) => sa.field1 < sb.field1)
The exact code would depend of your data type.
As a full example of this
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object SortArrayRDD {
val spark = SparkSession
.builder()
.appName("SortArrayRDD")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","SortArrayRDD") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
try {
Logger.getRootLogger.setLevel(Level.ERROR)
val arrRDD: Array[RDD[Int]] = Array(sc.parallelize(List(2,3)),sc.parallelize(List(10,11)),sc.parallelize(List(6,7)),sc.parallelize(List(8,9)),
sc.parallelize(List(4,5)),sc.parallelize(List(0,1)),sc.parallelize(List(12,13)),sc.parallelize(List(14,15)))
val aux = arrRDD
implicit val ordering: Ordering[RDD[Int]] = Ordering.fromLessThan[RDD[Int]]((sa: RDD[Int], sb: RDD[Int]) => sa.sum() < sb.sum())
aux.sorted.foreach(rdd => println(rdd.collect().mkString(",")))
val resultOrder = aux.sorted
assert(!resultOrder.sameElements(arrRDD))
println("It's unordered")
} finally {
sc.stop()
}
}
}

Change datatype on scala Spark Streaming

On that course on Module 3 - hands on lab ... there's an example (Spark Fundamentals 1) that I'm using to learn Scala and Spark.
https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/courseware/14ec4166bc9b4a3a9592b7960f4a5401/b0c736193c834b01b3c1c5bd4ce2d8a8/
I tried to modify the Streaming part in order to calculate the moving average as streaming comes in. I haven't figured out how to do it, but right now I'm facing the problem that I don't know how to change the datatype.
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc,Seconds(1))
val lines = ssc.socketTextStream("localhost",7777)
import scala.collection.mutable.Queue
var ints = Queue[Double]()
def movingAverage(values: Queue[Double], period: Int): List[Double] = {
val first = (values take period).sum / period
val subtract = values map (_ / period)
val add = subtract drop period
val addAndSubtract = add zip subtract map Function.tupled(_ - _)
val res = (addAndSubtract.foldLeft(first :: List.fill(period - 1)(0.0)) {
(acc, add) => (add + acc.head) :: acc
}).reverse
res
}
val pass = lines.map(_.split(",")).
map(pass=>(pass(7).toDouble))
pass.getClass
class org.apache.spark.streaming.dstream.MappedDStream
ints ++= List(pass).to[Queue]
Name: Compile Error
Message: console :41: error: type mismatch;
found : scala.collection.mutable.Queue[org.apache.spark.streaming.dstream.DStream[Double]]
required: scala.collection.TraversableOnce[Double]
ints ++= List(pass).to[Queue]
^
StackTrace:
al pass2 = movingAverage(ints,2)
pass2.print()
ints.dequeue
ssc.start()
ssc.awaitTermination()
How to get the streaming data from pass to ints as a queue of doubles?

After a lot of asking
val p1 = new scala.collection.mutable.Queue[Double]
pass.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
p1 += item ;
println(item +" - "+ movingAverage(p1,2).last) ;
}
})

Scala : Product with Serializable does not take parameters

My objectif is to read Data from a csv file and convert my rdd to dataframe in scala/spark. This is my code :
package xxx.DataScience.CompensationStudy
import org.apache.spark._
import org.apache.log4j._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object CompensationAnalysis {
case class GetDF(profil_date:String, profil_pays:String, param_tarif2:String, param_tarif3:String, dt_titre:String, dt_langues:String,
dt_diplomes:String, dt_experience:String, dt_formation:String, dt_outils:String, comp_applications:String,
comp_interventions:String, comp_competence:String)
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf().setAppName("CompensationAnalysis ")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val lines = sc.textFile("C:/Users/../Downloads/CompensationStudy.csv").flatMap { l =>
l.split(",") match {
case field: Array[String] if field.size > 13 => Some(field(0), field(1), field(2), field(3), field(4), field(5), field(6), field(7), field(8), field(9), field(10), field(11), field(12))
case field: Array[String] if field.size == 1 => Some((field(0), "default value"))
case _ => None
}
}
At this stade, I had the error : Product with Serializable does not take parameters
val summary = lines.collect().map(x => GetDF(x("profil_date"), x("profil_pays"), x("param_tarif2"), x("param_tarif3"), x("dt_titre"), x("dt_langues"), x("dt_diplomes"), x("dt_experience"), x("dt_formation"), x("dt_outils"), x("comp_applications"), x("comp_interventions"), x("comp_competence")))
val sum_df = summary.toDF()
df.printSchema
}
}
This is a screenshot :
Help please ?

You have several things you should improve. The most urgent problem, which causes the exception, is, as #CyrilleCorpet points out, " the three different lines in the pattern matching return values of types Some[Tuple13], Some[Tuple2] and None.type. The least-upper-bound is then Option[Product with Serializable] which complies with flatMap's signature (where the result should be an Iterable[T]) modulo some implicit conversion."
Basically, if you had Some[Tuple13], Some[Tuple13], and None or Some[Tuple2], Some[Tuple2], and None, you would be better off.
Also, pattern matching on types is generally a bad idea because of type erasure, and pattern matching isn't even great anyway for your situation.
So you could set default values in your case class:
case class GetDF(profile_date: String,
profile_pays: String = "default",
param_tarif2: String = "default",
...
)
Then in your lambda:
val tokens = l.split
if (l.length > 13) {
Some(GetDf(l(0), l(1), l(2)...))
} else if (l.length == 1) {
Some(GetDf(l(0)))
} else {
None
}
Now in all cases you are returning Option[GetDF]. You can flatMap the RDD to get rid of all the Nones and keep only GetDF instances.

how to print Map[String, Array[Float]] in scala?

I am using word2vec function which is inside mllib library of Spark. I want to print word vectors which I am getting as output to "getVectors" function
My code looks like this:
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
object word2vec {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("word2vec")
val sc = new SparkContext(conf)
val input = sc.textFile("file:///home/snap-01/balance.csv").map(line => line.split(",").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
model.save(sc, "myModelPath")
val sameModel = Word2VecModel.load(sc, "myModelPath")
val vec = sameModel.getVectors
print(vec)
}
}
I am getting "Map(Balance -> [F#2932e15f)"

Try this :
vec.foreach { case (key, values) => println("key " + key + " - " + values.mkString("-")
}

Alternatively,
println(vec.mapValues(_.toList))
But keep an eye on the memory required to do so.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)

No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types