How to set type of dataset when applying transformations and how to implement transformations without using spark.sql.functions._? - scala

I am a beginner for Scala and have been working on the following problem:
Example dataset named as given_dataset with player number and points scored
|player_no| |points|
1 25.0
1 20.0
1 21.0
2 15.0
2 18.0
3 24.0
3 25.0
3 29.0
Problem 1:
I have a dataset and need to calculate total points scored, average points per game, and number of games played. I am unable to explicitly set the data type to "double", "int", "float", when I apply the transformations. (Perhaps this is because they are untyped transformations?) Would anyone be able to help on this and correct my error?
No data type specified (but code is able to run)
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").orderBy("player_no")
Please note that I would like to retain the player number as I plan to merge total_points_dataset, games_played_dataset, and avg_points_dataset together.
Data type specified, but code crashes!
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[Double].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[Int].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[Double].orderBy("player_no")
Problem 2:
I would like to implement the above without using the library spark.sql.functions e.g. through functions such as map, groupByKey etc. If possible, could anyone provide an example for this and point me towards the right direction?

If you don't want to use import org.apache.spark.sql.types.{FloatType, IntegerType, StructType} then you have to cast it either at the time of reading or using as[(Int, Double)] in the dataset. Below is the example while reading from CSV file for your dataset:
/** A function that splits a line of input into (player_no, points) tuples. */
def parseLine(line: String): (Int, Float) = {
// Split by commas
val fields = line.split(",")
// Extract the player_no and points fields, and convert to integer & float
val player_no = fields(0).toInt
val points = fields(1).toFloat
// Create a tuple that is our result.
(player_no, points)
}
And then read as below:
val sc = new SparkContext("local[*]", "StackOverflow75354293")
val lines = sc.textFile("data/stackoverflowdata-noheader.csv")
val dataset = lines.map(parseLine)
val total_points_dataset2 = dataset.reduceByKey((x, y) => x + y)
val total_points_dataset2_sorted = total_points_dataset2.sortByKey(ascending = true)
total_points_dataset2_sorted.foreach(println)
val games_played_dataset2 = dataset.countByKey().toList.sorted
games_played_dataset2.foreach(println)
val avg_points_dataset2 =
dataset
.mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.sortByKey(ascending = true)
avg_points_dataset2.collect().foreach(println)
I locally tried running both ways and both are working fine, we can check the below output also:
(3,78.0)
(1,66.0)
(2,33.0)
(1,3)
(2,2)
(3,3)
(1,22.0)
(2,16.5)
(3,26.0)
For details you can see it on mysql page

Regarding "Problem 1" try
val total_points_dataset = given_dataset.groupBy($"player_no").sum("points").as[(Int, Double)].orderBy("player_no")
val games_played_dataset = given_dataset.groupBy($"player_no").count().as[(Int, Long)].orderBy("player_no")
val avg_points_dataset = given_dataset.groupBy($"player_no").avg("points").as[(Int, Double)].orderBy("player_no")

Related

How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

Compare the similarity between two text in Scala

I want to compare two texts in Scala and calculate the similarity rate. I begin to code this but I'm blocked :
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]):Unit = {
val white = "/whiteCat.txt" // "The white cat is eating a white soup"
val black = "/blackCat.txt" // "The black cat is eating a white sandwich"
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
val b = sc.textFile(white)
val words = b.flatMap(line => line.split("\\W+"))
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
counts.take(10).foreach(println)
//counts.saveAsTextFile(outputFile)
}
}
I succeeded to split the words of each text and count the occurency of each word. For example in the file1 there is :
(The,1)
(white,2)
(cat,1)
(is,1)
(eating,1)
(a,1)
(soup,1)
To calculate the similarity rate. I have to do this algorithm but I'm not experienced with Scala
i=0
foreach word in the first text
j = 0
IF keyFile1[i] == keyFile2[j]
THEN MIN(valueFile1[i], valueFile2[j]) / MAX(valueFile1[i], valueFile2[j])
ELSE j++
i++
Can you help me please ?
You can use leftOuterJoin to join the two key/value-pair RDDs to generate a RDD of type Array[(String, (Int, Option[Int]))], gather both counts from the Tuples, flatten the counts to type Int, and apply your min/max formula, as in the following example:
val wordCountsWhite = sc.textFile("/path/to/whitecat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsWhite.collect
// res1: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,2), (The,1), (soup,1), (a,1)
// )
val wordCountsBlack = sc.textFile("/path/to/blackcat.txt").
flatMap(_.split("\\W+")).
map((_, 1)).
reduceByKey(_ + _)
wordCountsBlack.collect
// res2: Array[(String, Int)] = Array(
// (is,1), (eating,1), (cat,1), (white,1), (The,1), (a,1), (sandwich,1), (black,1)
// )
val similarityRDD = wordCountsWhite.leftOuterJoin(wordCountsBlack).map{
case (k: String, (c1: Int, c2: Option[Int])) => {
val counts = Seq(Some(c1), c2).flatten
(k, counts.min.toDouble / counts.max )
}
}
similarityRDD.collect
// res4: Array[(String, Double)] = Array(
// (is,1.0), (eating,1.0), (cat,1.0), (white,0.5), (The,1.0), (soup,1.0), (a,1.0)
// )
That seems straight forward using for comprehension
for( a <- counts1; b <- counts2 if a._1==b._1 ) yield Math.min(a._2,b._2)/Math.max(a._2,b._2)
Edit:
sorry, the above code does not work.
Here's a modified code with for comprehension. counts1 and counts2 are 2 counts from the question.
val result= for( (t1,t2) <- counts1.cartesian(counts2) if( t1._1==t2._1)) yield Math.min(t1._2,t2._2).toDouble/Math.max(t1._2,t2._2).toDouble
the result:
result.foreach(println)
1.0
0.5
1.0
1.0
1.0
There are numerous algorithms to find similarity between strings. One of these methods is edit distance. There are different definitions of edit distance and there are different sets of operations based on the methodology. But main idea is finding minimum series of operations (insertion, deletion, substitution) to convert string a into string b.
Levenshtein distance and Longest Common Subsequence are widely known algorithms to find similarity between strings. But these methods are insensitive to contexts. Because of this situation, you may want to take a look at this article in which n-gram similarity and distance are represented. You can also find Scala implementations of these algorithms in github or rosetta code.
I hope it helps!

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")

Finding the average of a data set using Apache Spark

I am learning how to use Apache Spark and I am trying to get the average temperature from each hour from a data set. The data set that I am trying to use is from weather information stored in a csv. I am having trouble finding how to first read in the csv file and then calculating the average temperature for each hour.
From the spark documentation I am using the example Scala line to read in a file.
val textFile = sc.textFile("README.md")
I have given the link for the data file below. I am using the file called JCMB_2014.csv as it is the latest one with all months covered.
Weather Data
Edit:
The code I have tried so far is:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
}
val csv = sc.textFile("JCMB_2014.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = rows.map(row => header(row,"date-time")
val usersByHits = rows.map(row => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)
Here is sample code for calculating averages on hourly basis
Step1:Read file, Filter header,extract time and temp columns
scala> val hourlyTemps = lines.map(line=>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
scala> hourlyTemps.take(1)
res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))
(time/60) discards minutes and keeps only hours
Step2:Aggregate temperatures and no of occurrences
scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))
Step2:Calculate Averages using total and no of occurrences
Find the final result below.
val avgTemps=aggregateTemps.map(tuple=>(tuple._1,tuple._2._1/tuple._2._2))
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))
You may want to provide Structure definition of your CSV file and convert your RDD to DataFrame, like described in the documentation. Dataframes provide a whole set of useful predefined statistic functions as well as the possibility to write some simple custom functions. You then will be able to compute the average with:
dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)

Writing output of the Principal Components Analysis to text file

I have performed a Principal Component Analysis on a matrix I previously loaded with sc.textFile. The output being a org.apache.spark.mllib.linalg.Matrix I then converted it to a RDD[Vector[Double]].
with:
import java.io.PrintWriter
I did:
val pw = new PrintWriter("Matrix.csv")
rows3.collect().foreach(line => pw.println(line))
pw.flush
The output csv is promising. the only problem is that each line is a DenseVector(some values). How do I split each line into the corresponding coefficients?
Thanks a lot
You can use results of the computePrincipalComponents and breeze.linalg.csvwrite:
import java.io.File
import breeze.linalg.{DenseMatrix => BDM, csvwrite}
val mat: RowMatrix = ...
val pca = mat.computePrincipalComponents(...)
csvwrite(
new File("Matrix.csv"),
new BDM[Double](mat.numRows, mat.numCols, mat.toArray))
convert each vector to a string (you can do it either on the driver or the executers)
val pw = new PrintWriter("Matrix.csv")
rows3.map(_.mkString(",")).collect().foreach(line => pw.println(line))
pw.flush
edit:
if your data is too big to fit in the memory of the driver, you can try something like that:
val rdd = rows3.map(_.mkString(",")).zipWithIndex.cache
val total = rdd.count
val step = 10000 //rows in each chunk
val range = 0 to total by step
val limits = ranges.zip(range.drop(1))
limits.foreach { case(start, end) =>
rdd.filter(x => x._2 >= start && x._2 < end)
.map(_._1)
.collect
.foreach(pw.println(_))
}
I can't try this out, but that is the general idea