Forward fill pyspark to scala - scala

Got the following pyspark code how can I change it to adapt it to scala. Doing forwards and backwards fill on missing data
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([
('d1',None),
('d2',10),
('d3',None),
('d4',30),
('d5',None),
('d6',None),
],('day','temperature'))
w_forward = Window.partitionBy().orderBy('day').rowsBetween(Window.unboundedPreceding,Window.currentRow)
w_backward = Window.partitionBy().orderBy('day').rowsBetween(Window.currentRow,Window.unboundedFollowing)
df.withColumn('fill_forward',F.last('temperature',ignorenulls=True).over(w_forward))\
.withColumn('fill_both',F.first('fill_forward',ignorenulls=True).over(w_backward)).show()

Here:
case class Day(day: String, temperature: Option[Int])
import org.apache.spark.sql.expressions.Window
import spark.implicits._
import org.apache.spark.sql.functions.{last, first}
val df = spark
.createDataFrame[Day](
Seq(
Day("d1", None),
Day("d2", Some(10)),
Day("d3", None),
Day("d4", Some(30)),
Day("d5", None),
Day("d6", None)
)
)
val wForward = Window
.partitionBy()
.orderBy($"day")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val wBackWard = Window
.partitionBy()
.orderBy($"day")
.rowsBetween(Window.currentRow, Window.unboundedFollowing)
df.withColumn(
"fill_forward",
last($"temperature", ignoreNulls = true).over(wForward)
).withColumn(
"fill_both",
first("fill_forward", ignoreNulls = true).over(wBackWard)
).show()
Easy, isnĀ“t it?
The main difference is that you can use a case class if you want to avoid setting explicitly the df schema using Row.

Related

select the first element after sorting column and convert it to list in scala

what is the most efficient way to sort one column in data frame, convert it to list, and assign the first element to variable in scala. I tried the following
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, first, regexp_replace}
import org.apache.spark.sql.functions._
println(CONFIG.getString("spark.appName"))
val conf = new SparkConf()
.setAppName(CONFIG.getString("spark.appName"))
.setMaster(CONFIG.getString("spark.master"))
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.format("com.databricks.spark.csv").option("delimiter", ",").load("file.csv")
val dfb=df.sort(desc("_c0"))
val list=df.select(df("_c0")).distinct
but I'm still no able to save the first element as variable
Use select, orderBy, map & head
Assuming column _c0 is of type string, If it is different type you have to modify your column data type in _.getAs[<your column datatype>]
Check below code.
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.map(_.getAs[String](0))
.head
Or
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.head
.getAs[String](0)

How to convert RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

I am trying to convert RDD[Row] to RDD[Vector] but it throws exception stating
java.lang.ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector
My code is
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.range(0,10).withColumn("uniform" , rand(10L)).withColumn("normal1" , randn(10L)).withColumn("normal2" , randn(11L))
val assembler = new VectorAssembler().setInputCols(Array("uniform" ,"normal1","normal2")).setOutputCol("features")
val dfVec = assembler.transform(df)
val dfOutlier = dfVec.select("id" , "features").union( spark.createDataFrame(Seq( (10 , org.apache.spark.mllib.linalg.Vectors.dense(3,3,3)) )) )
dfOutlier.show(false)
val scaler = new StandardScaler().setInputCol("features").setOutputCol("Scaled").setWithStd(true).setWithMean(true)
val model = scaler.fit(dfOutlier).transform(dfOutlier)
model.show(false)
val dfVecRdd = model.select("Scaled").rdd.map(_(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] )
When I perform action on dfVecRdd, exception is raised. How can I solve this?
Try to remove this import in your code
org.apache.spark.mllib.linalg.Vector
and import this one
import org.apache.spark.ml.linalg.Vectors

Cannot convert an RDD to Dataframe

I've converted a dataframe to an RDD:
val rows: RDD[Row] = df.orderBy($"Date").rdd
And now I'm trying to convert it back:
val df2 = spark.createDataFrame(rows)
But I'm getting an error:
Edit:
rows.toDF()
Also produces an error:
Cannot resolve symbol toDF
Even though I included this line earlier:
import spark.implicits._
Full code:
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util._
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.rdd._
object Playground {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("Playground")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val df = spark.read.csv("D:/playground/mre.csv")
df.show()
val rows: RDD[Row] = df.orderBy($"Date").rdd
val df2 = spark.createDataFrame(rows)
rows.toDF()
}
}
Your IDE is right, SparkSession.createDataFrame needs a second parameter: either a bean class or a schema.
This will fix your problem:
val df2 = spark.createDataFrame(rows, df.schema)

Scala -- Use evaluation of an expression to write dataframe to a csv file

This is to use evaluation (Eval or something similar) of an expression (string) to write a dataframe to a csv file in Scala.
import org.apache.spark.sql.{SaveMode, SparkSession, SQLContext, Row, DataFrame, Column}
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
import scala.reflect.runtime.currentMirror
val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
val df_write = """df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test")"""
// This is one of my failed attempts - I have tried using the interpreter as well (code not shown here).
val tb = runtimeMirror(getClass.getClassLoader).mkToolBox()
toolbox.eval(toolbox.parse(df_write))
Errors are:
object coalesce is not a member of package df ....
Shiva, try the below code. The issue was that the object variables were not in scope for the toolbox and therefore it was unable to evaluate the expression.
package com.mansoor.test
import org.apache.spark.sql.{DataFrame, SparkSession}
object Driver extends App {
def evalCode[T](code: String): T = {
import scala.tools.reflect.ToolBox
import scala.reflect.runtime.{currentMirror => m}
val toolbox = m.mkToolBox()
toolbox.eval(toolbox.parse(code)).asInstanceOf[T]
}
val sparkSession: SparkSession = SparkSession.builder().appName("Test")
.master("local[2]")
.getOrCreate()
import sparkSession.implicits._
val df: DataFrame = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
val df_write =
s"""
|import com.mansoor.test.Driver._
|
|df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test")
""".stripMargin
evalCode[Unit](df_write)
sparkSession.sparkContext.stop()
}

convert string data in dataframe into double

I have a csv file containing double type.When i load to a dataframe i got this message telling me that the type string is java.lang.String cannot be cast to java.lang.Double although my data are numeric.How do i get a dataframe from this csv file containing double type.how should i modify my code.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._
object Example extends App {
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")
What sould i make to transform each row in the dataframe into double type? Thanks
Use select with cast:
import org.apache.spark.sql.functions.col
data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
c => col(c).cast("double")
): _*)
or pass schema to the reader:
define the schema:
import org.apache.spark.sql.types._
val cols = Seq(
"col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
)
val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")
val schema = StructType(cols.map(
c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
))
and use it as an argument for schema method
spark.read.schema(schema).csv(path)
It is also possible to use schema inference:
spark.read.option("inferSchema", "true").csv(path)
but it is much more expensive.
I believe using sparks inferSchema option comes in handy while reading the csv file. Below is the code to automatically detect your columns as double type :
val data = spark.read
.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.load("C://lpsa.data").toDF()
Note: I am using spark version 2.2.0