How to round decimal in Scala Spark - scala

I have a (large ~ 1million) Scala Spark DataFrame with the following data:
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
How do I discretise/round the scores to the nearest 0.05 decimal place?
Expected result:
id,score
1,0.95
2,1.00
3,0.85
4,0.85
...
Would like to avoid using UDF to maximise performance.

The answer can be simplifier:
dataframe.withColumn("rounded_score", round(col("score"), 2))
there is a method
def round(e: Column, scale: Int)
Round the value of e to scale decimal places with HALF_UP round mode

You can do it using spark built in functions like so
dataframe.withColumn("rounded_score", round(col("score") * 100 / 5) * 5 / 100)
Multiply it so that the precision you want is a whole number.
Then divide that number by 5, and round.
Now the number is divisable by 5, so multiply it by 5 to get back the entire number
Divide by 100 to get the precision correct again.
result
+---+-----+-------------+
| id|score|rounded_score|
+---+-----+-------------+
| 1|0.956| 0.95|
| 2|0.977| 1.0|
| 3|0.855| 0.85|
| 4|0.866| 0.85|
+---+-----+-------------+

You can specify your schema when convert into dataframe ,
Example :
DecimalType(10, 2) for the column in your customSchema when loading data.
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
import org.apache.spark.sql.types._
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("score", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(mySchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").show

Related

how make elements of a list lower case?

I have a df tthat one of the columns is a set of words. How I can make them lower case in the efficient way?
The df has many column but the column that I am trying to make it lower case is like this:
B
['Summer','Air Bus','Got']
['Parmin','Home']
Note:
In pandas I do df['B'].str.lower()
If I understood you correctly, you have a column that is an array of strings.
To lower the string, you can use lower function like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
data = [
{"B": ["Summer", "Air Bus", "Got"]},
]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df = df.withColumn("result", F.expr("transform(B, x -> lower(x))"))
Result:
+----------------------+----------------------+
|B |result |
+----------------------+----------------------+
|[Summer, Air Bus, Got]|[summer, air bus, got]|
+----------------------+----------------------+
A slight variation on #vladsiv's answer, which tries to answer a question in the comments about passing a dynamic column name.
# set column name
m = "B"
# use F.tranform directly, rather than in a F.expr
df = df.withColumn("result", F.transform(F.col(m), lambda x:F.lower(x)))

Losing precision when moving to Spark for big decimals

Below is the sample test code and its output. I see that java bigDecimal stores all the digits where as scala BigDecimal is losing on precision and does some rounding off and the same is happening with spark. Is there a way to set the precision or say never round off. I do not want to truncate or round off in any case
val sc = sparkSession
import java.math.BigDecimal
import sc.implicits._
val bigNum : BigDecimal = new BigDecimal(0.02498934809987987982348902384928349)
val convertedNum: scala.math.BigDecimal = scala.math.BigDecimal(bigNum)
val scalaBigNum: scala.math.BigDecimal = scala.math.BigDecimal(0.02498934809987987982348902384928349)
println("Big num in java" + bigNum)
println("Converted " + convertedNum)
println("Big num in scala " + scalaBigNum)
val ds = List(scalaBigNum).toDS()
println(ds.head)
println(ds.toDF.head)
Output
Big num in java0.0249893480998798801773208566601169877685606479644775390625
Converted 0.0249893480998798801773208566601169877685606479644775390625
Big num in scala 0.02498934809987988
0.024989348099879880
[0.024989348099879880]
Based on spark.apache.org/docs
The precision can be up to 38, scale can also be up to 38 (less or equal to precision). The default precision and scale is (10, 0).
here: https://www.scala-lang.org/api/2.12.5/scala/math/BigDecimal.html
But if you want in a simple way then how about convert it to String before
converting to DF or DS in order to get the precise value. :)
Just try if you want :)

How to convert a mllib matrix to a spark dataframe?

I want to pretty print the result of a correlation in a zeppelin notebook:
val Row(coeff: Matrix) = Correlation.corr(data, "features").head
One of the ways to achieve this is to convert the result into a DataFrame with each value in a separate column and call z.show().
However, looking into the Matrix api I don't see any way to do this.
Is there another straight forward way to achieve this?
Edit:
The dataframe has 50 columns. Just converting to a string would not help as the output get truncated.
Using the toString method should be the easiest and fastest way if you simply want to print the matrix. You can change the output by inputting the maximum number of lines to print as well as max line width. You can change the formatting by splitting on new lines and ",". For example:
val matrix = Matrices.dense(2,3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
matrix.toString
.split("\n")
.map(_.trim.split(" ").filter(_ != "").mkString("[", ",", "]"))
.mkString("\n")
which will give the following:
[1.0,3.0,5.0]
[2.0,4.0,6.0]
However, if you want to convert the matrix to an DataFrame, the easiest way would be to first create an RDD and then use toDF().
val matrixRows = matrix.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Row")
Then to put each value in a separate column you can do the following
val numOfCols = matrixRows.head.length
val df2 = (0 until numOfCols).foldLeft(df)((df, num) =>
df.withColumn("Col" + num, $"Row".getItem(num)))
.drop("Row")
df2.show(false)
Result using the example data:
+----+----+----+
|Col0|Col1|Col2|
+----+----+----+
|1.0 |3.0 |5.0 |
|2.0 |4.0 |6.0 |
+----+----+----+

How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
}
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
split.asInstanceOf[RandomPartition[A]].values
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem).view.map(new RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices).view.map(new RandomPartition[A](_, valuesPerSlice, random))).toArray
}
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
rdd.foreach(println)
/*
* outputs:
* 30
* 86
* 75
* 20
* ...
*/
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
rdd.foreach(println)
/*
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
*/
and of course you can wrap it in a DataFrame very easily as well:
spark.createDataFrame(rdd).show()
/*
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
*/
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.master("local[*]")
.appName("testApp")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate()
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols).toList.map { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
https://spark.apache.org/docs/latest/mllib-statistics.html#random-data-generation
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

Spark in Scala: how do I compare two columns to number of locations where they are different?

I have a DataFrame in Spark called df. I have trained and machine learning model on a couple features and simply want to compute the accuracy between the label and prediction column.
scala> df.columns
res32: Array[String] = Array(feature1, feature2, label, prediction)
This would be mind-numbingly simple in numpy:
accuracy = np.sum(df.label == df.prediction) / float(len(df))
Is there a similarly easy way to do this in Spark using Scala?
I should also mention I'm completely new to Scala.
Required imports:
import org.apache.spark.sql.functions.avg
import spark.implicits._
Example data:
val df = Seq((0, 0), (1, 0), (1, 1), (1, 1)).toDF("label", "prediction")
Solution:
df.select(avg(($"label" === $"prediction").cast("integer")))
Result:
+--------------------------------------+
|avg(CAST((label = prediction) AS INT))|
+--------------------------------------+
| 0.75|
+--------------------------------------+
Add:
.as[Double].first
or
.first.getDouble(0)
if you need a local value. If you want to count replace:
avg(($"label" === $"prediction").cast("integer"))
with
sum(($"label" === $"prediction").cast("integer"))
or
count(when($"label" === $"prediction", true))