Change datatype on scala Spark Streaming

Change datatype on scala Spark Streaming - scala

On that course on Module 3 - hands on lab ... there's an example (Spark Fundamentals 1) that I'm using to learn Scala and Spark.
https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/courseware/14ec4166bc9b4a3a9592b7960f4a5401/b0c736193c834b01b3c1c5bd4ce2d8a8/
I tried to modify the Streaming part in order to calculate the moving average as streaming comes in. I haven't figured out how to do it, but right now I'm facing the problem that I don't know how to change the datatype.
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc,Seconds(1))
val lines = ssc.socketTextStream("localhost",7777)
import scala.collection.mutable.Queue
var ints = Queue[Double]()
def movingAverage(values: Queue[Double], period: Int): List[Double] = {
val first = (values take period).sum / period
val subtract = values map (_ / period)
val add = subtract drop period
val addAndSubtract = add zip subtract map Function.tupled(_ - _)
val res = (addAndSubtract.foldLeft(first :: List.fill(period - 1)(0.0)) {
(acc, add) => (add + acc.head) :: acc
}).reverse
res
}
val pass = lines.map(_.split(",")).
map(pass=>(pass(7).toDouble))
pass.getClass
class org.apache.spark.streaming.dstream.MappedDStream
ints ++= List(pass).to[Queue]
Name: Compile Error
Message: console :41: error: type mismatch;
found : scala.collection.mutable.Queue[org.apache.spark.streaming.dstream.DStream[Double]]
required: scala.collection.TraversableOnce[Double]
ints ++= List(pass).to[Queue]
^
StackTrace:
al pass2 = movingAverage(ints,2)
pass2.print()
ints.dequeue
ssc.start()
ssc.awaitTermination()
How to get the streaming data from pass to ints as a queue of doubles?

After a lot of asking
val p1 = new scala.collection.mutable.Queue[Double]
pass.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
p1 += item ;
println(item +" - "+ movingAverage(p1,2).last) ;
}
})

Related

Get and order biggest tuples from list

I'm trying to order my list and get the biggest 5 tuples in my list which will then be printed out, heres the code that i have been working with:
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Word Count")
val sc = new SparkContext(conf)
val test = scala.io.Source.fromFile("/home/cloudera/Books/book1.txt").getLines
val wordCount =
test
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]) {
(count, word) =>
count + (word -> (count.getOrElse(word, 0) + 1))
}
val formatteWordCount =
filtered
.map(tuple => s"${tuple._1} -> ${tuple._2}")
.mkString("\n", "\n", "\n")
when trying to launch the code the following lines gives the error:
diverging implicit expansion for type scala.math.Ordering[B]
starting with method Tuple9 in object Ordering
.sortBy(x => (x._2))
I also tried using .stableSort(k, (x, y) => x._2 < y.2) which gave the error value stableSort is not a member of String
and .maxBy(._2) which gave the error diverging implicit expansion for type Ordering[B]
starting with method Tuple9 in object Ordering
println(s"Final Word Count: $formatteWordCount")
}

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?

Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.

The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

GraphX not working properly Spark / Scala

I am trying to create a GraphX object in apache Spark/Scala but it doesn't seem to be working for some reason. I have attached a file of the example input file, the actual program code is:
package SGraph
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx._
`
object GooglePlusGraph {
/** Our main function where the action happens */
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "GooglePlusGraphX")
val lines = sc.textFile("../Example.txt")
val ratings = lines.map(x => x.toString().split(":")(0))
val verts = ratings.map(line => (line.toLong,line))
val edges = lines.flatMap(makeEdges)
val default = "Nobody"
val graph = Graph(verts, edges, default).cache()
graph.degrees.join(verts).take(10).foreach(println)
}
def makeEdges(line: String) : List[Edge[Int]] = {
import scala.collection.mutable.ListBuffer
var edges = new ListBuffer[Edge[Int]]()
val fields = line.split(",").flatMap(a => a.split(":"))
val origin = fields(0)
for (x <- 1 to (fields.length - 1)) {
// Our attribute field is unused, but in other graphs could
// be used to deep track of physical distances etc.
edges += Edge(origin.toLong, fields(x).toLong, 0)
}
return edges.toList
}
}
The first error i get is the following:
16/12/19 01:28:33 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 3)
java.lang.NumberFormatException: For input string: "935750800736168978117"
thanks for any help !

It's the same issue with the following your question.
Cannot convert string to a long in scala
The given number has 21 digits beyond the maximum number of digits of Long (19 digits).

Spark Code Optimization

My task is to write a code that reads a big file (doesn't fit into memory) reverse it and output most five frequent words .
i have written the code below and it does the job .
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object ReverseFile {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Reverse File")
conf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val txtFile = "path/README_mid.md"
val txtData = sc.textFile(txtFile)
txtData.cache()
val tmp = txtData.map(l => l.reverse).zipWithIndex().map{ case(x,y) => (y,x)}.sortByKey(ascending = false).map{ case(u,v) => v}
tmp.coalesce(1,true).saveAsTextFile("path/out.md")
val txtOut = "path/out.md"
val txtOutData = sc.textFile(txtOut)
txtOutData.cache()
val wcData = txtOutData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(ascending = false)
wcData.collect().take(5).foreach(println)
}
}
The problem is that i'm new to spark and scala, and as you can see in the code first i read the file reverse it save it then reads it reversed and output the five most frequent words .
Is there a way to tell spark to save tmp and process wcData (without the need to save,open file) at the same time because otherwise its like reading the file twice .
From now on i'm going to tackle with spark a lot, so if there is any part of the code (not like the absolute path name ... spark specific) that you might think could be written better i'de appreciate it.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object ReverseFile {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Reverse File")
conf.set("spark.hadoop.validateOutputSpecs", "false")
val sc = new SparkContext(conf)
val txtFile = "path/README_mid.md"
val txtData = sc.textFile(txtFile)
txtData.cache()
val reversed = txtData
.zipWithIndex()
.map(_.swap)
.sortByKey(ascending = false)
.map(_._2) // No need to deconstruct the tuple.
// No need for the coalesce, spark should do that by itself.
reversed.saveAsTextFile("path/reversed.md")
// Reuse txtData here.
val wcData = txtData
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
.map(_.swap)
.sortByKey(ascending = false)
wcData
.take(5) // Take already collects.
.foreach(println)
}
}
Always do the collect() last, so Spark can evaluate things on the cluster.

The most expensive part of your code is sorting so the obvious improvement is to remove it. It is relatively simple in the second case where full sort is completely obsolete:
val wcData = txtData
.flatMap(_.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _) // No need to swap or sort
// Use top method and explicit ordering in place of swap / sortByKey
val wcData = top(5)(scala.math.Ordering.by[(String, Int), Int](_._2))
Reversing order of lines is a little bit trickier. First lets reorder elements per partition:
val reversedPartitions = txtData.mapPartitions(_.toList.reverse.toIterator)
Now you have two options
use custom partitioner
class ReversePartitioner(n: Int) extends Partitioner {
def numPartitions: Int = n
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
return numPartitions - 1 - k
}
}
val partitioner = new ReversePartitioner(reversedPartitions.partitions.size)
val reversed = reversedPartitions
// Add current partition number
.mapPartitionsWithIndex((i, iter) => Iterator((i, iter.toList)))
// Repartition to get reversed order
.partitionBy(partitioner)
// Drop partition numbers
.values
// Reshape
.flatMap(identity)
It still requires shuffling but it is relatively portable and data is still accessible in memory.
if all you want is to save reversed data you can call saveAsTextFile on reversedPartitions and reorder output files logically. Since part-n name format identifies source partitions all you have to do is to rename part-n to part-(number-of-partitions - 1 -n). It requires saving data so it is not exactly optimal but if you for example use in-memory file system can be a pretty good solution.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)

No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Change datatype on scala Spark Streaming - scala

After a lot of asking val p1 = new scala.collection.mutable.Queue[Double] pass.foreachRDD( rdd => { for(item <- rdd.collect().toArray) { p1 += item ; println(item +" - "+ movingAverage(p1,2).last) ; } })

Related

Get and order biggest tuples from list

Scala : map Dataset[Row] to Dataset[Row]

GraphX not working properly Spark / Scala

Spark Code Optimization

sortByKey in Spark

Categories

Resources