spark create Dataframe in UDF - scala

I have a example, want to create Dataframe in a UDF. Something like the one below
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.feature.VectorAssembler
data to Dataframe
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
UDF
def predict(predictSet:Vector):Double={
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata)
result.rdd.take(1)(0)(3).toString.toDouble}
spark.udf.register("predict", predict _)
testData.registerTempTable("datatable")
spark.sql("SELECT predict(features) FROM datatable").take(1)
i get an error like
ERROR Executor: Exception in task 3.0 in stage 4.0 (TID 7) [Executor task launch worker for task 7]
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
and
WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 7, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => double)
Is dataframe not supported? I am using Spark 2.3.0 and Scala 2.11. thanks

As mentioned in comments, you don't need UDF here to apply the Trained model to test data. You can apply the model to test dataframe in the main program itself as below:
val df = Seq((1,1,34,23,34,56),(2,1,56,34,56,23),(3,0,34,23,23,78),(4,0,23,34,78,23),(5,1,56,23,23,12),
(6,1,67,34,56,34),(7,0,23,23,23,56),(8,0,12,34,45,89),(9,1,12,34,12,34),(10,0,12,34,23,34)).toDF("id","label","tag1","tag2","tag3","tag4")
val assemblerDF = new VectorAssembler().setInputCols(Array("tag1", "tag2", "tag3","tag4")).setOutputCol("features")
val data = assemblerDF.transform(df)
val Array(train,test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val testData=test.toDF
val loadmodel=LogisticRegressionModel.load("/user/xu/savemodel")
sc.broadcast(loadmodel)
val assemblerFe = new VectorAssembler().setInputCols(Array("a", "b", "c","d")).setOutputCol("features")
sc.broadcast(assemblerFe)
val set=Seq((1,2,3,4)).toDF("a","b","c","d")
val predata = assemblerFe.transform(set)
val result=loadmodel.transform(predata) // Applying model on predata dataframe. You can apply model on any DataFrame.
result is a DataFrame now, you can Reigister the DataFrame as a table and query predictionLabel and features using SQL OR you can directly select the predictLabel and other fields from DataFrame.
Please note, UDF is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets. It doesnt return the DataFrame itself as a return type. and generally its not advised to use UDF's unless necessary, refer to: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs-blackbox.html

Related

Scala sortbykey and collect function

I am a beginner in Spark in Scala. So I am writing a program where I am reading a CSV file, then I am counting the total spending done by a particular ID number. So after counting the spending, when I am sorting the RDD using sortByKey(), it's not sorting the RDD properly, but after applying collect() it's printing in a proper manner.
Before collect()
(0,5524.9497)
(51,4975.2197)
(1,4958.5996)
(52,5245.0605)
(2,5994.591)
(53,4945.3)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997)```
**After Collect**
```(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997) ```
**Code**
``` def main(args: Array[String])= {
Logger.getLogger("org").setLevel(Level.ERROR) //Set for displaying errors in the program if any
val sc = new SparkContext("local[*]", "CustomerSpending")
val lines = sc.textFile("../customer-orders.csv")
val field = lines.map(x => (x.split(",")(0).toInt, x.split(",")(2).toFloat))
val collectThemAll = field.reduceByKey((x,y) => x+y)
val sorted = collectThemAll.sortByKey().collect()
sorted.foreach(println)
}
}
Spark applies transformations lazily i.e. only when you call an action like collect or take etc. So your call to sortByKey() is only applied after you call the collect.
I created an App based on your sample data. I printed the RDD dependency using toDebugString so you can get insight into what is happening behind the scenes.
App
import org.apache.spark.sql.SparkSession
object PlaygroundApp extends App {
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val lines = sc.parallelize(Seq(
(0, 5524.9497),
(51, 4975.2197),
(1, 4958.5996),
(52, 5245.0605),
(2, 5994.591),
(53, 4945.3),
(9, 5322.6494),
(10, 4819.6997))
)
val collectThemAll = lines.reduceByKey((x, y) => x + y)
println("---Before sort")
collectThemAll.foreach(println)
println(collectThemAll.toDebugString)
println()
println("---After sort")
val sorted = collectThemAll.sortByKey()
sorted.collect().foreach(println)
println(sorted.toDebugString)
}
Output
---Before sort
(2,5994.591)
(53,4945.3)
(0,5524.9497)
(52,5245.0605)
(10,4819.6997)
(9,5322.6494)
(1,4958.5996)
(51,4975.2197)
(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []
---After sort
(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(9,5322.6494)
(10,4819.6997)
(51,4975.2197)
(52,5245.0605)
(53,4945.3)
(8) ShuffledRDD[4] at sortByKey at PlaygroundApp.scala:37 []
+-(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []

Task not serializable in community.cloud.databricks

Databricks community cloud is throwing
an org.apache.spark.SparkException: Task not serializable exception that my local machine is not throwing executing the same code.
The code comes from the Spark in Action book. What the code is doing is reading a json file with github activity data, then reading a file with employees usernames from an invented company, and ultimately ranks the employees by number of pushes.
To avoid extra shuffling, the variable containing the list of employees is broadcasted, however, when its time to return the rank, is when databricks community cloud throws the exception.
import org.apache.spark.sql.SparkSession
import scala.io.Source.fromURL
val spark = SparkSession.builder()
.appName("GitHub push counter")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "/FileStore/tables/2015_03_01_0-a829c.json"
val pushes = spark.read.json(inputPath).filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count.orderBy(grouped("count").desc)
val empPath = "https://raw.githubusercontent.com/spark-in-action/first-edition/master/ch03/ghEmployees.txt"
val employees = Set() ++ (for { line <- fromURL(empPath).getLines} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
If you have custom classes like employee/user then you need to implement serialization interface.Spark when working with custom user objects,needs to serialize them.
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

scala spark conversion error when creating dataframe

I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.

Spark reducer and summation result issue

Here is sample file
Department,Designation,costToCompany,State
Sales,Trainee,12000,UP
Sales,Lead,32000,AP
Sales,Lead,32000,LA
Sales,Lead,32000,TN
Sales,Lead,32000,AP
Sales,Lead,32000,TN
Sales,Lead,32000,LA
Sales,Lead,32000,LA
Marketing,Associate,18000,TN
Marketing,Associate,18000,TN
HR,Manager,58000,TN
Produce an output as csv
Group by department, desigination, State
With additional columns with sum(costToCompany) and sum(TotalEmployeeCount)
Result should be like
Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000
Following is the solution and writing to file is resulting in an error. What am i doing wrong here?
Step #1: Load file
val file = sc.textFile("data/sales.txt")
Step #2: Create a case class to represt the data
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
defined class emp
Step #3: Split data and create RDD of emp object
scala> val fileSplit = file.map(_.split(","))
scala> val data = fileSplit.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
Step #4: Turn the data into Key/value par with key=(dept, desg,state) and value=(1,totalCost)
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
Step #5: Group by using reduceByKey, as we want summation as well for total number of employees and the cost
scala> val results = keyVals.reduceByKey{(a,b) => (a._1+b._1, a._2+b._2)} //(a.count+ b.count, a.cost+b.cost)
results: org.apache.spark.rdd.RDD[((String, String, String), (Int, Double))] = ShuffledRDD[41] at reduceByKey at <console>:55
Step #6: save the results
scala> results.repartition(1).saveAsTextFile("data/result")
Error
17/08/16 22:16:59 ERROR executor.Executor: Exception in task 0.0 in stage 20.0 (TID 23)
java.lang.NumberFormatException: For input string: "costToCompany"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
at java.lang.Double.parseDouble(Double.java:540)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/08/16 22:16:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 23, localhost, executor driver): java.lang.NumberFormatException: For input string: "costToCompany"
Update 1
Forgot to remove header. update code here. Save is throwing a different error now. also, need to put the header back in the file.
scala> val file = sc.textFile("data/sales.txt")
scala> val header = fileSplit.first()
scala> val noHeaderData = fileSplit.filter(_(0) != header(0))
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
scala> val data = noHeaderData.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
scala> val resultSpecific = results.map(x => (x._1._1, x._1._2, x._1._3, x._2._1, x._2._2))
scala> resultSpecific.repartition(1).saveASTextFile("data/specific")
<console>:64: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
To answer your question as well as comments:
It would be easier for you to utilize dataframes in this case, as your file is in csv format you can use the following way to load and save the data. In this way, you do not need to concern yourself with splitting the rows in the file as well as taking care of the header (both when loading and saving).
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true") //reading the headers
.load("csv/file/path");
The dataframe column names will then be the same as the header in the file. Instead of reduceByKey() you can use the dataframe's groupBy() and agg():
val res = df.groupBy($"Department", $"Designation", $"State")
.agg(count($"costToCompany").alias("empCount"), sum($"costToCompany").alias("totalCost"))
Then save it:
res.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save("results.csv")
when you are trying to cast into double, costToCompany string wont cast thats why its stuck when try to fire action. just drop first record from file and then it will work . you can also do such operation on dataframe also which will be easy
The error is straight forward and it says that
:64: error: value saveASTextFile is not a member of
org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
resultSpecific.repartition(1).saveASTextFile("data/specific")
In fact, you got no method called saveASTextFile(...) but saveAsTextFile(???). you have case error on your method name.

Spark sqlContext UDF acting on Sets

I've been trying to define a function that works within Spark's DataFrame which takes scala sets as input and outputs an integer. I'm getting the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 25.0 failed 1 times, most recent failure: Lost task 20.0 in stage 25.0 (TID 473, localhost): java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.Set
Here is a simple code that gives the crux of the issue:
// generate sample data
case class Dummy( x:Array[Integer] )
val df = sqlContext.createDataFrame(Seq(
Dummy(Array(1,2,3)),
Dummy(Array(10,20,30,40))
))
// define the UDF
import org.apache.spark.sql.functions._
def setSize(A:Set[Integer]):Integer = {
A.size
}
// For some reason I couldn't get it to work without this valued function
val sizeWrap: (Set[Integer] => Integer) = setSize(_)
val sizeUDF = udf(sizeWrap)
// this produces the error
df.withColumn("colSize", sizeUDF('x)).show
What am I missing here? How can I get this to work? I know I can do this by casting to RDD but I don't want to go back and forth between RDD and DataFrames.
Use Seq:
val sizeUDF = udf((x: Seq) => setSize(x.toSet))