How do I dynamically create a UDF in Spark? - scala

I have a DataFrame where I want to create multiple UDFs dynamically to determine if certain rows match. I am just testing one example right now. My test code looks like the following.
//create the dataframe
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
//create the scala function
def filter(v1: Seq[Any], v2: Seq[String]): Int = {
for (i <- 0 until v1.length) {
if (!v1(i).equals(v2(i))) {
return 0
}
}
return 1
}
//create the udf
import org.apache.spark.sql.functions.udf
val fudf = udf(filter(_: Seq[Any], _: Seq[String]))
//apply the UDF
df.withColumn("filter1", fudf(Seq($"n1"), Seq("t"))).show()
However, when I run the last line, I get the following error.
:30: error: not found: value df
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
:30: error: type mismatch;
found : Seq[String]
required: org.apache.spark.sql.Column
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
Any ideas on what I'm doing wrong? Note, I am on Scala v2.11.x and Spark 2.0.x.
On another note, if we can solve this "dynamic" UDF question/concern, my use case would be to add them to the dataframe. With some test code as follows, it takes forever (it doesn't even finish, I had to ctrl-c to break out). I'm guessing doing a bunch of .withColumn in a for-loop is a bad idea in Spark. If so, please let me know and I'll abandon this approach altogether.
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
import org.apache.spark.sql.functions.udf
val fudf = udf( (x: String) => if (x.equals("t")) 1 else 0)
var df2 = df
for (i <- 0 until 10000) {
df2 = df2.withColumn("filter"+i, fudf($"n1"))
}

Enclose "t" in lit()
df.withColumn("filter1", fudf($"n1", Seq(lit("t")))).show()
Try registering UDF on sqlContext.
Spark 2.0 UDF registration

Related

Use a set as the key in Dataset#groupByKey

Is there a way to use a Set as the key with Dataset#groupByKey? It looks like, for sets, Spark uses an encoder meant for arrays. This causes the order of values within a set to change the outcome.
Here's an example:
import org.apache.spark.sql._
object Main extends App {
val spark =
SparkSession
.builder
.appName("spark")
.master("local")
.getOrCreate()
import spark.implicits._
println {
List("foo", "bar")
.toDS()
.groupByKey {
case "foo" => Set(1, 2)
case "bar" => Set(2, 1) // append .toList.sorted.toSet to get expected behavior
}
.keys
.collect
.mkString("\n")
}
spark.close()
}
I expect this to produce a single key, Set(1, 2). Instead, it produces two. The encoder looks like it's meant for ordered collections:
val e: Encoder[Set[Int]] = implicitly[Encoder[Set[Int]]]
println(s"${e}") // class[value[0]: array<int>]
Is this a bug? Should there be an encoder for sets? Is that even feasible?

Scala sortbykey and collect function

I am a beginner in Spark in Scala. So I am writing a program where I am reading a CSV file, then I am counting the total spending done by a particular ID number. So after counting the spending, when I am sorting the RDD using sortByKey(), it's not sorting the RDD properly, but after applying collect() it's printing in a proper manner.
Before collect()
(0,5524.9497)
(51,4975.2197)
(1,4958.5996)
(52,5245.0605)
(2,5994.591)
(53,4945.3)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997)```
**After Collect**
```(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(3,4659.63)
(4,4815.05)
(5,4561.0703)
(6,5397.8794)
(7,4755.0693)
(8,5517.24)
(9,5322.6494)
(10,4819.6997) ```
**Code**
``` def main(args: Array[String])= {
Logger.getLogger("org").setLevel(Level.ERROR) //Set for displaying errors in the program if any
val sc = new SparkContext("local[*]", "CustomerSpending")
val lines = sc.textFile("../customer-orders.csv")
val field = lines.map(x => (x.split(",")(0).toInt, x.split(",")(2).toFloat))
val collectThemAll = field.reduceByKey((x,y) => x+y)
val sorted = collectThemAll.sortByKey().collect()
sorted.foreach(println)
}
}
Spark applies transformations lazily i.e. only when you call an action like collect or take etc. So your call to sortByKey() is only applied after you call the collect.
I created an App based on your sample data. I printed the RDD dependency using toDebugString so you can get insight into what is happening behind the scenes.
App
import org.apache.spark.sql.SparkSession
object PlaygroundApp extends App {
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val lines = sc.parallelize(Seq(
(0, 5524.9497),
(51, 4975.2197),
(1, 4958.5996),
(52, 5245.0605),
(2, 5994.591),
(53, 4945.3),
(9, 5322.6494),
(10, 4819.6997))
)
val collectThemAll = lines.reduceByKey((x, y) => x + y)
println("---Before sort")
collectThemAll.foreach(println)
println(collectThemAll.toDebugString)
println()
println("---After sort")
val sorted = collectThemAll.sortByKey()
sorted.collect().foreach(println)
println(sorted.toDebugString)
}
Output
---Before sort
(2,5994.591)
(53,4945.3)
(0,5524.9497)
(52,5245.0605)
(10,4819.6997)
(9,5322.6494)
(1,4958.5996)
(51,4975.2197)
(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []
---After sort
(0,5524.9497)
(1,4958.5996)
(2,5994.591)
(9,5322.6494)
(10,4819.6997)
(51,4975.2197)
(52,5245.0605)
(53,4945.3)
(8) ShuffledRDD[4] at sortByKey at PlaygroundApp.scala:37 []
+-(12) ShuffledRDD[1] at reduceByKey at PlaygroundApp.scala:28 []
+-(12) ParallelCollectionRDD[0] at parallelize at PlaygroundApp.scala:17 []

scala spark conversion error when creating dataframe

I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.

Get Cluster_ID and the rest of table using Spark MLlib KMeans

I have this dataset (I'm putting some a few rows):
11.97,1355,401
3.49,25579,12908
9.29,129186,10882
28.73,10153,22356
3.69,22872,9798
13.49,160371,2911
24.36,106764,867
3.99,163670,16397
19.64,132547,401
And I'm trying to assign all this rows to 4 clusters using K-Means. For that I'm using the code that I see in this post: Spark MLLib Kmeans from dataframe, and back again
val data = sc.textFile("/user/cloudera/TESTE1")
val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
val clusters = KMeans.train(idPointRDD.map(_._2), 4, 20)
val clustersRDD = clusters.predict(idPointRDD.map(_._2))
val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
val idCluster = idClusterRDD.toDF("purchase","id","product","cluster")
I'm getting this outputs:
scala> import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> val data = sc.textFile("/user/cloudera/TESTE")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/TESTE MapPartitionsRDD[7] at textFile at <console>:29
scala> val idPointRDD = data.map(s => (s(0), Vectors.dense(s(1).toInt,s(2).toInt))).cache()
idPointRDD: org.apache.spark.rdd.RDD[(Char, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[8] at map at <console>:31
But when I run it I'm getting the following error:
java.lang.UnsupportedOperationException: Schema for type Char is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
How can I solve this problem?
Many thanks!
Here is the thing. You are actually reading a CSV of values into an RDD of String and not converting it properly to numeric values. Instead since a string is a collection of character when you call upon s(0) per example this actually works converts the Char value to an integer or a double but it's not what you are actually looking for.
You need to split your val data : RDD[String]
val data : RDD[String] = ???
val idPointRDD = data.map {
s =>
s.split(",") match {
case Array(x,y,z) => Vectors.dense(x.toDouble, Integer.parseInt(y).toDouble,Integer.parseInt(z).toDouble)
}
}.cache()
This should work for you !

Can't run LDA on Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)] in Spark 2.0

I am following this tutorial video on LDA example and I'm getting the following issue :
<console>:37: error: overloaded method value run with alternatives:
(documents: org.apache.spark.api.java.JavaPairRDD[java.lang.Long,org.apache.spark.mllib.linalg.Vector])org.apache.spark.mllib.clustering.LDAModel <and>
(documents: org.apache.spark.rdd.RDD[(scala.Long, org.apache.spark.mllib.linalg.Vector)])org.apache.spark.mllib.clustering.LDAModel
cannot be applied to (org.apache.spark.sql.Dataset[(scala.Long, org.apache.spark.mllib.linalg.Vector)])
val model = run(lda_countVector)
^
So I want to convert this DF to RDD but it is always assigned as DataSet for me. Can anyone please look into this issue?
// Convert DF to RDD
import org.apache.spark.mllib.linalg.Vector
val lda_countVector = countVectors.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
// import org.apache.spark.mllib.linalg.Vector
// lda_countVector: org.apache.spark.sql.Dataset[(Long, org.apache.spark.mllib.linalg.Vector)] = [_1: bigint, _2: vector]
Spark API changed between 1.x and 2.x branch. In particular DataFrame.map returns Dataset not an RDD so the result is not compatible with old MLlib RDD-based API. You should convert data to RDD first as followed :
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0, 2.0, 3.0))
val b = Vectors.dense(Array(3.0, 4.0, 5.0))
val df = Seq((1L ,a), (2L, b), (2L, a)).toDF
val ldaDF = df.rdd.map {
case Row(id: Long, countVector: Vector) => (id, countVector)
}
val model = new LDA().setK(3).run(ldaDF)
or you can convert to typed dataset and then to RDD:
val model = new LDA().setK(3).run(df.as[(Long, Vector)].rdd)